BUPT-MCPRL at TRECVID 2013
Xin Guo,Yuanbo Chen,W. Liu,Yuanhui Mao,Han Zhang,Kang Zhou,Lingxi Wang,Hua Yan,Zhicheng Zhao,Yanyun Zhao,A. Cai
2013-01-01
Abstract:In this paper, we describe BUPT-MCPRL systems for TRECVID [6] 2013. Our team participated in two tasks: automatic instance search and surveillance event detection. A brief introduction is shown as follows: A. Automatic instance search In our work, we divide the topics into 2 kinds, i.e. object and person according to the query description, and treat differently for each kind. However, because of the errors in key-frame extraction, we get a low infAP score. Table 1. INS results and descriptions for each run Run ID infAP Description F_X_NO_BUPT.MCPRL_2 0.014 BoW scheme F_X_NO_BUPT.MCPRL_3 0.019 BoW scheme with global feature B. Surveillance event detection This year, we focus on the events of ObjectPut, PersonRuns, Pointing, PeopleMeet, PeopleSplitUp, and Embrace. These events are divided into three groups. Our system adopted different algorithms in detecting events accordingly. 1 Automatic instance Search 1.1 Object retrieval We adopt both local and global features. For local feature, we first choose key-points by Hessian-affine detector and describe them using the SIFT, and then generate a 63k generic codebook by approximate k-means clustering [1] with training images crawled randomly from Flickr [2]. Then, each descriptor is assigned to the closest cluster center in feature space and all the local features are aggregated followed by the BoW scheme. For global feature, we use a 512-dims HSV correlation histogram to describe the global color distribution of the image. For 4 given query images and their corresponding masks of each topic, visual vocabularies as well as HSV correlation histogram are first extracted from each image. Then for each query, we have two vocabulary sets, i.e. = { ,... , }, = { ,... , }, where is an unordered set of visual words from the whole image and is an unordered set only containing visual words from ROI set by the mask. Note that both and are unordered sets, which means we are only interested in whether the visual word appears in the image, but ignore the times it appears due to the sparseness of the aggregated feature. We then take the union of , i.e. = ∪ ∪ ...∪ as the vocabulary set we need to pay special attention. Then, for each reference image, we also get an unordered set of visual words and a HSV correlation histogram as its feature. Also, the idf coefficient for each visual word is calculated from the reference set as = ( / ), where N is the total number of images and is the times visual word appears in an image. After the features from both query and reference images are extracted, the similarity between each query image and reference is calculated as below: 1) Find the intersection between query and reference : = ∩ ; 2) Set a weight to each ∈ in this way: if ∈ , a high weight is set (in our experiment, we set = 3), otherwise, = ; 3) Similarity between query and reference is: , = ∑ /(| | + | |). After step 3, we have a similarity score for each reference image. Then, RANSAC is used to further identify the spatial relationship between each query image and reference. We then combine the number of inliers together with the similarity score computed before to get the final score for each image: = ( ⋅ ) ⋅ ( ⋅ ). In our experiment, we set = 1, = 10. In addition, the similarity of the HSV correlation histogram between each query and reference image is calculated in terms of the Mahalanobis distance. We fuse the score of local and global features with their linear combination. 1.2 Person retrieval In the off-line phase, a face detector is adopted to find the face regions in each reference image, followed by a 3200-dims LBP feature extraction to describe each detected face. In the on-line phase, a similar process is done for each query image. However, for query image, instead of the whole image, the face detector is only used in the ROI confined by the given mask. Then, the Euclidean distance is used to measure the distance between LBP features from each query and reference face region. Since each topic contains four query images, for each reference image, its shortest distance to all the query face regions is chosen as the distance to the topic. In this way, we obtain the rank list for each topic.