Supplemental Material: Inside-Outside Net: Detecting Objects in Context with Skip Pooling and Recurrent Neural Networks

Sean Bell, C Lawrence Zitnick, Kavita Bala, Ross Girshick
Abstract:In this section, we describe our submission to the 2015 MS COCO Detection Challenge, which won Best Student Entry and finished 3rd place overall, with a score of 31.0% mAP on 2015 test-challenge and 31.2% on 2015 test-dev. Later in this section we describe further post-competition improvements to achieve 33.1% on test-dev. Both models use a single ConvNet (no ensembling). For our challenge submission, we made several improvements: used a mix of MCG (Multiscale Combinatorial Grouping [3]) and RPN (Region Proposal Net [4]) proposal boxes, added two extra 512x3x3 convolutional layers, trained for longer, and used two rounds of bounding box regression with a modified version of weighted voting [1]. At test time, our model runs in 2.7 seconds/image on a single Titan X GPU (excluding proposal generation). We describe all changes in more detail below (note that many of these choices were driven by the need to meet the challenge deadline, and thus may be suboptimal):1. Train+ val. For the competition, we train on both train and validation sets. We hold out 5000 images from the validation as our new validation set called “minival.” 2. MCG+ RPN box proposals. We get the largest improvement by replacing selective search with a mix of MCG [3] and RPN [4] boxes. We modify RPN from the baseline configuration described in [4] by adding more anchor boxes, in particular smaller ones, and using a mixture of 3x3 (384) and 5x5 (128) convolutions. Our anchor configuration uses a total of 22 anchors per location with the following shapes: 32x32 and aspect ratios {1: 2, 1: 1, 2: 1}× scales {64, 90.5, 128, 181, 256, 362, 512}. We also …
What problem does this paper attempt to address?