BBN VISER TRECVID 2013 Multimedia Event Detection and Multimedia Event Recounting Systems.

Pradeep Natarajan,Shuang Wu,Florian Luisier,Xiaodan Zhuang,Manasvi Tickoo,Guangnan Ye,Dong Liu,Shih-Fu Chang,Imran Saleemi,Mubarak Shah,Vlad I. Morariu,Larry Davis,Abhinav Gupta,Ismail Haritaoglu,Sadiye Guler,Ashutosh Morde
2013-01-01
Abstract:We describe the Raytheon BBN Technologies (BBN) led VISER system for the TRECVID 2013 Multimedia Event Detection (MED) and Recounting (MER) tasks. We present a comprehensive analysis of the different modules: (1) a large suite of visual, audio and multimodal low-level features; (2) video- and segment-level semantic scene/action/object concepts; (3) automatic speech recognition (ASR); (4) videotext detection and recognition (OCR). For the low-level features, we used multiple static, motion-based, color, and audio features and Fisher Vector (FV) representation. For the semantic concepts, we developed various visual concept sets in addition to multiple existing visual concept banks. In particular, we used BBN's natural language processing (NLP) technologies to automatically identify and train salient concepts from short textual descriptions of research set videos. We also exploited online data resources to augment the concept banks. For the speech and videotext content, we leveraged rich confidence-weighted keywords and phrases obtained from the ASR and OCR systems. We combined these different streams using multiple early (feature-level) and late (score-level) fusion strategies. Our system involves both SVMbased and query-based detections, to achieve superior performance despite of the varying number of positive videos in the event kit. We present a thorough study of different semantic feature based systems compared to low-level feature based systems. Consistent with previous MED evaluations, low-level features still exhibit strong performance. Further, our semantic feature based systems have improved significantly, and produce gains in fusion, especially in the EK10 and EK0 conditions. On the prespecified condition, the mean average precision (MAP) of our VISER system are 33%, 16.6% and 5.2% for the EK100, EK10 and EK0 conditions respectively. These are largely consistent with our ad hoc results that are 32.2%, 14.3% and 8.1% for the EK100, EK10 and EK0 conditions respectively. For the MER task, our system has an accuracy of 64.96% and takes only 52.83% of the video length for the evaluators to analyze the evidence and make their judgment.
What problem does this paper attempt to address?