IBM Research TRECVID-2010 Video Copy Detection and Multimedia Event Detection System.
Matthew L. Hill,Gang Hua,Apostol Natsev,John R. Smith,Lexing Xie,Bert Huang,Michele Merler,Hua Ouyang,Mingyuan Zhou
2010-01-01
Abstract:In this paper, we describe the system jointly developed by IBM Research and Columbia University for video copy detection and multimedia event detection applied to the TRECVID-2010 video retrieval benchmark. A. Content-Based Copy Detection: The focus of our copy detection system this year was fusing three types of complementary fingerprints: a keyframe-based color correlogram, SIFTogram (bag of visual words), and a GIST-based fingerprint. However, in our official submissions, we did not use the color correlogram component since our best results on the training set came from the GIST and SIFTogram components. A summary of our runs is listed below: 1. IBM.m.nofa.gistG: A run based on the grayscale GIST frame-level feature, with at most 1 result per query, except in the case of ties. 2. IBM.m.balanced.gistG: As in the above run, but with including more results per query, though on average still less than 2. 3. IBM.m.nofa.gistGC: The result of the nofa.gistG run, fused with results from GIST features extracted from the R,G,B color channels. 4. IBM.m.nofa.gistGCsift: The result of the nofa.gistGC run, fused with a SIFTogram result. Overall, the grayscale GIST approach performed best. We found it produced excellent results when tested on the ∗IBM T. J. Watson Research Center, Hawthorne, NY, USA †Dept. of Computer Science, Columbia University ‡College of Computing, Georgia Tech §Dept. of Electrical Engineering, Duke University TRECVID-2009 data set, with an optimal NDCR that surpassed what we had achieved with SIFTogram previously. The “gistG” runs also outperformed our other runs on the 2010 data, although we changed the SIFT implementation we used this year which made it not directly comparable with our previous TRECVID results. Our system did not make use of any audio features. B. Multimedia Event Detection: Our MED system has three aspects to its design – a variety of global, local, and spatial-temporal descriptors; building detectors from a large-scale semantic basis, and designing temporal motif features: 1. IBM-CU 2010 MED EVAL cComboAll 1 : Combination of all classifiers. 2. IBM-CU 2010 MED EVAL pComboIBM+CUHOF 1 : Combination of global image features, spatial-temporal interest points, audio features, and model vector classifiers. 3. IBM-CU 2010 MED EVAL cComboStatic 1 : Combination of global image features, and model vector classifiers. 4. IBM-CU 2010 MED EVAL cComboDynamic 1 : Combination of spatial-temporal interest points, audio features, temporal motif, and HMM classifiers. 5. IBM-CU 2010 MED EVAL cComboIBM+CUHOF 2 :Combination of global image features, spatial-temporal interest points, audio features, and model vector classifiers. 6. IBM-CU 2010 MED EVAL cComboIBM-HOF 1 : Combination of global image features, spatialtemporal HOG points, and model vector classifiers. 7. IBM-CU 2010 MED EVAL cComboIBM 1 : Combination of global image features, spatialtemporal interest points, and model vector classifiers. 8. IBM-CU 2010 MED EVAL cmodelVectorAvg 1 : Run with 272 semantic model vector features. 9. IBM-CU 2010 MED EVAL cTemporalMotifs 1 : Semantic model vector feature with sequential motifs. 10. IBM-CU 2010 MED EVAL cmvxhmm 1 : Semantic model vector feature with hierarchical HMM state histograms. Overall, the semantic model vector is our best-performing single feature, while the combination of dynamic features outperforms the static features, and temporal motif and hierarchical HMMs show promising performance.