Abstract:Deep networks should be robust to rare events if they are to be successfully deployed in high-stakes real-world applications (e.g., self-driving cars). Here we study the capability of deep networks to recognize objects in unusual poses. We create a synthetic dataset of images of objects in unusual orientations, and evaluate the robustness of a collection of 38 recent and competitive deep networks for image classification. We show that classifying these images is still a challenge for all networks tested, with an average accuracy drop of 29.5% compared to when the objects are presented upright. This brittleness is largely unaffected by various network design choices, such as training losses (e.g., supervised vs. self-supervised), architectures (e.g., convolutional networks vs. transformers), dataset modalities (e.g., images vs. image-text pairs), and data-augmentation schemes. However, networks trained on very large datasets substantially outperform others, with the best network tested$\unicode{x2014}$Noisy Student EfficentNet-L2 trained on JFT-300M$\unicode{x2014}$showing a relatively small accuracy drop of only 14.5% on unusual poses. Nevertheless, a visual inspection of the failures of Noisy Student reveals a remaining gap in robustness with the human visual system. Furthermore, combining multiple object transformations$\unicode{x2014}$3D-rotations and scaling$\unicode{x2014}$further degrades the performance of all networks. Altogether, our results provide another measurement of the robustness of deep networks that is important to consider when using them in the real world. Code and datasets are available at <a class="link-external link-https" href="https://github.com/amro-kamal/ObjectPose" rel="external noopener nofollow">this https URL</a>.

Analysing the Effects of Pooling Combinations on Invariance to Position and Deformation in Convolutional Neural Networks

Pooling is neither necessary nor sufficient for appropriate deformation stability in CNNs

OVPT: Optimal Viewset Pooling Transformer for 3D Object Recognition.

Wasserstein Pooling for Image Classification

On the Shift Invariance of Max Pooling Feature Maps in Convolutional Neural Networks

Progress and limitations of deep networks to recognize objects in unusual poses

Inductive Bias of Deep Convolutional Networks through Pooling Geometry

Adaptive Salience Preserving Pooling for Deep Convolutional Neural Networks

Cross-convolutional-layer Pooling for Generic Visual Recognition.

Deep CNNs Meet Global Covariance Pooling: Better Representation and Generalization

Quantifying Translation-Invariance in Convolutional Neural Networks

Generalizing Pooling Functions in Convolutional Neural Networks: Mixed, Gated, and Tree

Exploring Novel Pooling Strategies for Edge Preserved Feature Maps in Convolutional Neural Networks

Gradient Corner Pooling for Keypoint-Based Object Detection

Equivariant vs. Invariant Layers: A Comparison of Backbone and Pooling for Point Cloud Classification

Effect of pooling strategy on convolutional neural network for classification of hyperspectral remote sensing images

Maximal Independent Sets for Pooling in Graph Neural Networks

Balanced Mixture of SuperNets for Learning the CNN Pooling Architecture

Object Level Deep Feature Pooling for Compact Image Representation

TI-POOLING: transformation-invariant pooling for feature learning in Convolutional Neural Networks

Untangling Local and Global Deformations in Deep Convolutional Networks for Image Classification and Sliding Window Detection