Informedia @ Trecvid 2011 Informedia @ Trecvid 2011 Multimedia Event Detection, Semantic Indexing 1 Multimedia Event Detection (med) 1.1 Feature Extraction
Lei Bao,Longfei Zhang,Shoou-I Yu,Zhen-zhong Lan,Lu Jiang,Arnold Overwijk,Qin Jin,Shohei Takahashi,Brian Langner,Yuanpeng Li,Michael Garbus,Susanne Burger,Florian Metze,Alexander Hauptmann
2012-01-01
Abstract:We report on our results in the TRECVID 2011 Multimedia Event Detection (MED) and Semantic Indexing (SIN) tasks. Generally, both of these tasks consist of three main steps: extracting features, training detectors and fusing. In the feature extraction part, we extracted many low-level features, high-level features and text features. We used the Spatial-Pyramid Matching technique to represent the low-level visual local features, such as SIFT and MoSIFT, which describe the location information of feature points. In the detector training part, besides the traditional SVM, we proposed a Sequential Boosting SVM classifier to deal with the large-scale unbalanced classification problem. In the fusion part, to take the advantages from different features, we tried three different fusion methods: early fusion, late fusion and double fusion. Double fusion is a combination of early fusion and late fusion. The experimental results demonstrated that double fusion is consistently better than or at least comparable to early fusion and late fusion. 1 Multimedia Event Detection (MED) 1.1 Feature Extraction In order to encompass all aspects of a video, we extracted a wide variety of visual and audio features as shown in figure 1. Table 1: Features used for the MED task. Visual Features Audio Features Low-level Features • SIFT [19] • Color SIFT [19] • Transformed Color Histogram [19] • Motion SIFT [3] • STIP [9] Mel-Frequency Cepstral Coefficients High-level Features • PittPatt Face Detection [12] • Semantic Indexing Concepts [15] Acoustic Scene Analysis Text Features Optical Character Recognition Automatic Speech Recognition 1.1.1 SIFT, Color SIFT (CSIFT), Transformed Color Histogram (TCH) These three features describe the gradient and color information of a static image. We used the Harris-Laplace detector for corner detection. For more details, please see [19]. Instead of extracting features from all frames for all videos, we first run shot-break detection and only extract features from the keyframe of a corresponding shot. The shot-break detection algorithm detects large color histogram differences between adjacent frames and a shot-boundary is detected when the histogram difference is larger than a threshold. For the 16507 training videos, we extracted 572,881 keyframes. For the 32061 testing videos, we extracted 1,035,412 keyframes. Once we have the keyframes, we extract the three features as in [19]. Given the raw feature files, a 4096 word codebook is acquired using the K-Means clustering algorithm. According to the codebook and given a region in an image, we can create a 4096 dimensional vector representing that region. Using the Spatial-Pyramid Matching [10] technique, we extract 8 regions from an keyframe image and calculate a bag-of-words vector for each region. At the end, we get a 8× 4096 = 32768 dimensional bag-of-words vector. The 8 regions are calculated as follows. • The whole image as one region. • Split the image into 4 quadrants and each quadrant is a region. • Split the image horizontally into 3 equally sized rectangles and each rectangle is a region. Since we only have feature vectors describing a keyframe, and a video is described by many keyframes, we compute a vector representing a whole video by averaging over the feature vectors from each keyframe. The features are then provided to a classifier for classification. 1.1.2 Motion SIFT (MoSIFT) Motion SIFT [3] is a motion-based feature that combines information from SIFT and optical flow. The algorithm first extract SIFT points, and for each SIFT point, it checks whether there is a large enough optical flow near the point. If the optical flow value is larger than a threshold, a 256 dimensional feature is computed for that point. The first 128 dimensions of the feature vector is the SIFT descriptor, and the latter 128 dimensions describes the optical flow near the point. We extracted Motion SIFT by calculating the optical flow between neighboring frames, but due to speed issues, we only extract Motion SIFT for the every third frame. Once we have the raw features, a 4096 dimensional codebook is computed, and using the same process as SIFT, a 32768 dimensional vector is created for classification. 1.1.3 Space-Time Interest Points (STIP) Space-Time Interest Points are computed like in [9]. Given the raw features, a 4096 dimensional code is computed, and using the same process as SIFT, a 32768 dimensional vector is created for classification. 1.1.4 Semantic Indexing (SIN) We predicted the 346 semantic concepts from Semantic Indexing 11 onto the MED keyframes. For details on how we created the models for the 346 concepts, please refer to section 2. Once we have the prediction scores of each concept on each keyframe, we compute a 346 dimensional feature that represents a video. The value of each dimension is the mean value of the concept prediction scores on all keyframes in a given video. We tried out different kinds of score merging techniques, including mean and max, and mean had the best performance. These features are then provided to a classifier for classification.