Combining Orientational Pooling Features for Scene Recognition
Lingxi Xie,Jingdong Wang,Baining Guo,Bo Zhang,Qi Tian
2014-01-01
Abstract:Scene recognition is a basic task towards image understanding. Spatial Pyramid Matching (SPM) has been shown to be an efficient solution for spatial context modeling. In this paper, we introduce an alternative approach, Orientational Pyramid Matching (OPM), for orientational context modeling. Our approach is motivated by the observation that the 3D orientations of objects are a crucial factor to discriminate indoor scenes. The novelty lies in that OPM uses the 3D orientations to form the pyramid and produce the pooling regions, which is unlike SPM that uses the spatial positions to form the pyramid. Experimental results on challenging scene classification tasks show that OPM achieves the performance comparable with SPM and that OPM and SPM make complementary contributions so that their combination gives the state-of-the-art performance. 1. The Bag-of-Features Model The BoF model is composed of three basic stages: local descriptor extraction, feature encoding, and spatial pooling. The local feature extraction stage usually extracts a set of local descriptors, e.g., SIFT [8] or HOG [2], from the interest points or densely-sampled image patches of an image. The feature encoding module then assigns each descriptor to the closest entry in a visual vocabulary: a codebook learned offline by clustering a large set of descriptors with K-Means or Gaussian Mixture Model (GMM) algorithm. Feature encoding can also be sparse [13] or high-dimensional [9]. Spatial pooling consists of partitioning an image into a set of regions, aggregating feature-level statistics over these regions [18], and normalizing then concatenating the region descriptors as an image-level feature vector [16]. Image partition can be obtained by Spatial Pyramid Matching (SPM) [7]. Aggregation of descriptors within a region is often performed with a pooling strategy. 2. Our Approach In this section, we first introduce the proposed Orientational Pyramid Matching model, and then present the algorithm of estimating the 3D orientations for image patches. 2.1. Orientational Pyramid Matching Given a set of patch descriptors that are extracted from interest points or densely-sampled regions, the goal is to summarize then into an image-level feature vector. Different from Spatial Pyramid Matching (SPM) in which each patch descriptor is associated with its spatial position, our approach augments the patch descriptor f with an additional 3D orientation denoted by the azimuth and polar angles o = (θ, φ) . We denote the set of encoded local features as S = {(f1,o1) , (f2,o2) , . . . , (fM ,oM )}. The proposed Orientational Pyramid Matching (OPM) algorithm starts with partitioning the set S into subsets {St}, t = 1, 2, . . . , TO, where each subset consists of the patch descriptors that are close in the orientational angles rather than the spatial positions used in Spatial Pyramid Matching (SPM). The partition can be done in various ways, such as clustering the angles. In this paper, we follow the simple way similar to SPM and perform a regular partition scheme, i.e., dividing the orientational space U = [ −π2 , π 2 ]2 into regular grids, which is shown to perform well in practice. Let LA and LP be the numbers of the pyramid layers along the azimuth and polar angles, respectively. The bin in the l-th layer along the azimuth/polar angles is then of size π 2min{l,LA} × π 2min{l,LP} , i.e., the number of orientational pooling bins in the l-th layer is 2min{l,LA} × 2min{l,LP}. Denote the set of partitions produced from orientational pyramid by R1,R2, . . . ,RTO . Each region Rt contains a set of Mt patch descriptors {ft,1, ft,2, . . . , ft,Mt}. We aggregate the Mt features together to generate a descriptor ft for regionRt. The overall image feature is then obtained by concatenating the pooled feature vectors of all the regions.