Informedia @TRECVID 2012.
Shoou-I Yu,Zhongwen Xu,Duo Ding,Waito Sze,Francisco Vicente,Zhenzhong Lan,Yang Cai,Shourabh Rawat,Peter F. Schulam,Nisarga Markandaiah,Sohail Bahmani,Antonio Juárez,Wei Tong,Yi Yang,Susanne Burger,Florian Metze,Rita Singh,Bhiksha Raj,Richard M. Stern,Teruko Mitamura,Eric Nyberg,Lu Jiang,Qiang Chen,Lisa M. Brown,Ankur Datta,Quanfu Fan,Rogério Schmidt Feris,Shuicheng Yan,Alexander G. Hauptmann,Sharath Pankanti
2012-01-01
Abstract:We report on our system used in the TRECVID 2012 Multimedia Event Detection (MED) and Multimedia Event Recounting (MER) tasks. For MED, it consists of three main steps: extracting features, training detectors and fusion. In the feature extraction part, we extract many low-level, high-level, and text features. Those features are then represented in three different ways which are spatial bag-of words with standard tiling, spatial bag-of-words with feature and event specific tiling and the Gaussian Mixture Model Super Vector. In the detector training and fusion, two classifiers and three fusion methods are employed. The results from both the official sources and our internal evaluations show good performance of our system. Our MER system utilizes a subset of features and detection results from the MED system from which the recounting is generated. 1. MED System 1.1 Features In order to encompass all aspects of a video, we extracted a wide variety of low-level and highlevel features. Table 1 summarizes the features used in our system. Among those features, most of them are widely used features in the community, for example, SIFT, STIP and MFCC. We extracted those features using standard code available from the authors with default parameters. Table 1: Features used for MED’12 system Visual Features Audio Features Low-level features 1. SIFT (Sande, Gevers, & Snoek, 2010) 2. Color SIFT (CSIFT) (Sande, Gevers, & Snoek, 2010) 3. Motion SIFT (MoSIFT) (Chen & Hauptmann, 2009) 4. Transformed Color Histogram (TCH) (Sande, Gevers, & Snoek, 2010) 5. STIP (Wang, Ullah, Klaser, Laptev, & Schmid, 2009) 6. Dense Trajectory (Wang, Klaser, Schmid, & Liu, 2011) 1. MFCC 2. Acoustic Unit Descriptors (AUDs) (Chaudhuri, Harvilla, & Raj, 2011) High-level features 1. Semantic Indexing Concepts (SIN) (Over, et al., 2012) 2. Object Bank (Li, Su, Xing, & Fei-Fei, 2010) 1. Acoustic Scene Analysis Text Features 1. Optical Character Recognition 1. Automatic Speech Recognition Besides those common features, we have two home-grown features which are Motion SIFT (MoSIFT) and Acoustic Unit Descriptors (AUDs). We will introduce these two features in the following subsections. 1.1 .1 Motion SIFT (MoSIFT) Feature The goal of developing the MoSIFT feature is to combine the features from the spatial domain and the temporal domain. Local spatio-temporal features around interest points provide compact and descriptive representations for video analysis and motion recognition. Current approaches tend to extend spatial descriptions by adding a temporal component to the appearance descriptor, which only implicitly captures motion information. MoSIFT detects interest points and encodes not only their local appearance but also explicitly models local motion. The idea is to detect distinctive local features through local appearance and motion. Figure 1 demonstrates the MoSIFT algorithm. Figure 1: System flow chart of the MoSIFT algorithm. The algorithm takes a pair of video frames to find spatio-temporal interest points at multiple scales. Two major computations are applied: SIFT point detection and optical flow computation according to the scale of the SIFT points. For the descriptor, MoSIFT adapts the idea of grid aggregation in SIFT to describe motions. Optical flow detects the magnitude and direction of a movement. Thus, optical flow has the same properties as appearance gradients. The same aggregation can be applied to optical flow in the neighborhood of interest points to increase robustness to occlusion and deformation. The two aggregated histograms (appearance and optical flow) are combined into the MoSIFT descriptor, which now has 256 dimensions. 1.1 .2 Acoustic Unit Descriptors (AUDs) We have developed an unsupervised lexicon learning algorithm that automatically learns units of sound. Each unit is such that it spans a set of audio frames, thereby taking local acoustic context into account. Using a maximum-likelihood estimation process, we can learn a set of such acoustic units unsupervised from audio data. Each of these units can be thought of as low-level fundamental units of sound, and each audio frame is generated by these units. We refer to these units as Acoustic Unit Descriptors (AUDs) and we expect that the distribution of these units will carry information about the semantic content of the audio stream. Each AUD is represented by a 5-state Hidden Markov Model (HMM) with a 4-gaussian mixture output density function. Ideally, with a perfect learning process, we would like to learn semantically interpretable lowerlevel units, such as a clap, a thud sound, a bang, etc. Naturally, it is hard to enforce semantic interpretability on the audio learning process at that level of detail. Further, because the space of all possible sounds is so large, many different sounds will be mapped into single sounds at learning time, since we can only learn a finite set of units. 1.2 Feature Representat ions In the previous section, we briefly describe the features we used in the system. In this section, we will describe the representations we used for the raw features extracted in Section 1. Three representations were used in our system. They were K-means based spatial bag-ofwords model with standard tiling (Lazebnik, Schmid, & Ponce, 2006), K-means based spatial bag-of-words with feature and event specific tiling (Viitaniemil & Laaksonen, 2009) and Gaussian Mixture Model Super Vector (Campbell & Sturim, 2006). Since the K-means based spatial bag-of-words model with standard tiling and Gaussian Mixture Model Super Vector are standard technology, we will focus on the K-means based spatial bag-of-words model with feature and event specific tiling. For simplicity, we will refer to it as tiling. Spatial bag-of-words model is a widely used representation of the low-level image/video features. The central idea of the spatial bag-of-words model is to divide the image into some small tiles and compute bag-of-words for each tile. Figure 2 shows a couple of tiling examples. Figure 2: Examples of tiling In general, the standard spatial bag-of-words tiling uses the 1x1, 2x2 and 4x4 tiling. However the use of those tilings is ad-hoc and some preliminary works have shown that other tilings might produce better performance (Viitaniemil & Laaksonen, 2009). In our system, we systematically tested 80 different tilings to select the best one for each feature and each event. Table 2 shows the performance of feature specific tiling v.s. the standard tiling. The scores are computed from our internal experiments and are the average over 20 MED12 pre-specified events. The PMiss @ TER=12.5 metric is an official evaluation metric specified in the MED 2012 Evaluation Plan. A smaller PMiss score signifies better performance. From the table, we can see clearly that for all of the five features, the feature specific tiling performs consistently at least 1% better than the standard tiling. Table 2: The performance of feature specific tiling and standard tiling Feature SIFT CSIFT TCH STIP MOSIFT Feature Specific Tiling 0.4209 0.4496 0.4914 0.5178 0.4330 Standard Tiling 0.4325 0.4618 0.5052 0.5234 0.4456 Figure 3 shows an example of the performance of event specific tiling v.s. standard tiling on Event 25 (marriage proposal), which is a difficult event identified in our experiments. It can be seen clearly that the event specific tiling can noticeably improve the performance over standard tiling. Figure 3: The comparison of event specific tiling and standard tiling on Event 25 1.3 Training and Fusion We used the standard MED’12 training dataset for our internal evaluation and the training of the models for our submission. For our internal evaluation, the MED’12 training dataset was further divided into the training set and testing set by randomly selecting half of the positive examples into the training set and the other half into the testing set. The negative examples consisted of only NULL videos which do not have label information. The two classifiers used in the system were kernel SVM and kernelized rigid regression. For simplicity, we will refer to it as kernel regression. For the K-means based feature representations, we used the Chi-squared kernel. For the GMM based representation RBF kernel was used. The parameters of the model were tuned by 5-fold cross validation and the PMiss @ TER = 12.5 metric was used as the evaluation metric. For combining features from multiple modalities and the outputs of different classifiers, we used fusion and ensemble methods. More specifically, for the same classifier, we used three fusion methods to fuse different features. The fusion methods were early fusion, late fusion and double fusion (Lan, Bao, Yu, Liu, & Hauptmann, 2012). In early fusion, the kernel matrices from different features were normalized first and then combined together. In late fusion, the prediction scores from the models trained using different features were combined. In our system, we also used a fusion method called double fusion, which combines early fusion and late fusion together. Finally, the results from different classifiers were ensembled together. Figure 4 shows the diagram of our system. Figure 4: The diagram of the system 0.55 0.57 0.59 0.61 0.63 0.65 0.67 0.69 0.71 0.73 0.75 CSIFT SIFT MOSIFT STIP TCH PM is s@ 12 .5 E025 Marriage_proposal baseline 1.4 Submiss ion In the following section we describe in detail the runs we submitted to NIST. Table 3 shows the official performance of each submission. 1.4 .1 Pre-Specified Submission 1.4.1.1 Submission 1: CMU_MED12_MED12TEST_PS_MEDFull_EKFull_AutoEAG_p_ensembleKRSVM_1 In this submission, using the features described in the previous section, we did the following to generate this run: 1. For each feature, train a SVM classifier and a kernel regression model. 2. Late fusion of all the results from SVM classifiers and kernel regression respectively. 3. Early fusion of all features except ASR. 4. Train a SVM classifier and a kernel