Large-Scale 3D Scene Classification With Multi-View Volumetric CNN

Dror Aiger,Brett Allen,Aleksey Golovinskiy
DOI: https://doi.org/10.48550/arXiv.1712.09216
2017-12-26
Abstract:We introduce a method to classify imagery using a convo- lutional neural network (CNN) on multi-view image pro- jections. The power of our method comes from using pro- jections of multiple images at multiple depth planes near the reconstructed surface. This enables classification of categories whose salient aspect is appearance change un- der different viewpoints, such as water, trees, and other materials with complex reflection/light response proper- ties. Our method does not require boundary labelling in images and works on pixel-level classification with a small (few pixels) context, which simplifies the cre- ation of a training set. We demonstrate this application on large-scale aerial imagery collections, and extend the per-pixel classification to robustly create a consistent 2D classification which can be used to fill the gaps in non- reconstructible water regions. We also apply our method to classify tree regions. In both cases, the training data can quickly be generated using a small number of manually- created polygons on a map. We show that even with a very simple and standard network our CNN outperforms the state-of-the-art image classification, the Inception-V3 model retrained from a large collection of aerial images.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is how to use multi - view image projection to improve the accuracy of classification in large - scale 3D scene classification, especially when dealing with categories such as water bodies and trees that have large appearance changes under different viewing angles. Specifically, the paper proposes a method based on convolutional neural network (CNN) to achieve this goal by using the projections of multiple images on multiple depth planes. This method can effectively deal with the following challenges: 1. **Scale of large - scale classification systems**: The system needs to have good generalization ability to avoid significantly increasing the size of the training set as the scope of use expands. 2. **Creation and management of training sets**: In order to simplify the creation of training sets, the system does not require a large amount of manually labeled data, but only a small amount of sparse pixel labels. 3. **Robustness to reconstruction and light consistency**: The system needs to be able to continue working when the underlying stereo pipeline changes. 4. **High precision**: Even a small error rate may lead to a large number of visible artifacts, so the system needs to achieve very high classification accuracy. The paper pays special attention to two application scenarios: - **Water body classification**: Water bodies are often difficult to handle in stereo reconstruction because they are often moving and have specular highlights, which are prone to produce misleading light - consistency maxima. The method proposed in the paper can identify and fill these errors, thereby generating more accurate water body classification results. - **Tree classification**: Trees are a common object that is very useful in many applications, from visualization to modeling to geographic information systems (GIS). The method in the paper overcomes the difficulties of single - image classification through the interaction between multi - view image features. Through these methods, the paper demonstrates its application on large - scale aerial image collections and extends per - pixel classification to create a consistent 2D classification for filling water body holes in non - reconstructed areas. In addition, the paper also experimentally verifies that its method is superior to existing image - based classification methods, such as the Inception - V3 model, in the water body classification task.