Abstract:Developing robust and interpretable vision systems is a crucial step towards trustworthy artificial intelligence. In this regard, a promising paradigm considers embedding task-required invariant structures, e.g., geometric invariance, in the fundamental image representation. However, such invariant representations typically exhibit limited discriminability, limiting their applications in larger-scale trustworthy vision tasks. For this open problem, we conduct a systematic investigation of hierarchical invariance, exploring this topic from theoretical, practical, and application perspectives. At the theoretical level, we show how to construct over-complete invariants with a Convolutional Neural Networks (CNN)-like hierarchical architecture yet in a fully interpretable manner. The general blueprint, specific definitions, invariant properties, and numerical implementations are provided. At the practical level, we discuss how to customize this theoretical framework into a given task. With the over-completeness, discriminative features w.r.t. the task can be adaptively formed in a Neural Architecture Search (NAS)-like manner. We demonstrate the above arguments with accuracy, invariance, and efficiency results on texture, digit, and parasite classification experiments. Furthermore, at the application level, our representations are explored in real-world forensics tasks on adversarial perturbations and Artificial Intelligence Generated Content (AIGC). Such applications reveal that the proposed strategy not only realizes the theoretically promised invariance, but also exhibits competitive discriminability even in the era of deep learning. For robust and interpretable vision tasks at larger scales, hierarchical invariant representation can be considered as an effective alternative to traditional CNN and invariants.

What Does CNN Shift Invariance Look Like? A Visualization Study

Inability of spatial transformations of CNN feature maps to support invariant recognition

RC-CNN: Representation-Consistent Convolutional Neural Networks for Achieving Transformation Invariance

On the Shift Invariance of Max Pooling Feature Maps in Convolutional Neural Networks

Transform-Invariant Convolutional Neural Networks for Image Classification and Search

Shift Equivariance in Object Detection

Invariant Feature Extraction for Image Classification Via Multi-Channel Convolutional Neural Network

Group Invariant Deep Representations for Image Instance Retrieval

Hierarchical Invariance for Robust and Interpretable Vision Tasks at Larger Scales

Improving Shift Invariance in Convolutional Neural Networks with Translation Invariant Polyphase Sampling

On the ability of CNNs to extract color invariant intensity based features for image classification

Shift Invariance Can Reduce Adversarial Robustness

Quantifying Translation-Invariance in Convolutional Neural Networks

Visual Orientation Inhomogeneity Based Convolutional Neural Networks

CNN Architectures for Geometric Transformation-Invariant Feature Representation in Computer Vision: A Review

Contrastive Identification of Covariate Shift in Image Data

Learning Geometric Invariance Features and Discrimination Representation for Image Classification via Spatial Transform Network and XGBoost Modeling

CNNComparator: Comparative Analytics of Convolutional Neural Networks.

Understanding image representations by measuring their equivariance and equivalence

Which Part of a Picture is Worth a Thousand Words: A Joint Framework for Finding and Visualizing Critical Linear Features from Images