Abstract:It is well observed that in deep learning and computer vision literature, visual data are always represented in a manually designed coding scheme (eg., RGB images are represented as integers ranging from 0 to 255 for each channel) when they are input to an end-to-end deep neural network (DNN) for any learning task. We boldly question whether the manually designed inputs are good for DNN training for different tasks and study whether the input to a DNN can be optimally learned end-to-end together with learning the weights of the DNN. In this paper, we propose the paradigm of {\em deep collective learning} which aims to learn the weights of DNNs and the inputs to DNNs simultaneously for given tasks. We note that collective learning has been implicitly but widely used in natural language processing while it has almost never been studied in computer vision. Consequently, we propose the lookup vision networks (Lookup-VNets) as a solution to deep collective learning in computer vision. This is achieved by associating each color in each channel with a vector in lookup tables. As learning inputs in computer vision has almost never been studied in the existing literature, we explore several aspects of this question through varieties of experiments on image classification tasks. Experimental results on four benchmark datasets, i.e., CIFAR-10, CIFAR-100, Tiny ImageNet, and ImageNet (ILSVRC2012) have shown several surprising characteristics of Lookup-VNets and have demonstrated the advantages and promise of Lookup-VNets and deep collective learning.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is: in deep neural networks (DNNs), whether the input data (such as images) can be optimally learned end - to - end together with the network weights. Specifically, the author questions whether the existing manually - designed input representations (such as the integer representation in RGB images) are suitable for DNN training of different tasks, and proposes a new paradigm - Deep Collective Learning, which aims to learn the weights and inputs of DNNs simultaneously. ### Main Problems and Motivations 1. **Limitations of Existing Input Representations**: - In computer vision, image pixels are usually represented in integer form (for example, 8 - bit integers in the RGB space, with a range from 0 to 255). These integers will be normalized before being input into the DNN. - This manually - designed input representation assumes that color changes cause linear changes in input values in the image, but this assumption may not always be optimal. 2. **Exploring the Possibility of Automatically Learning Inputs**: - The author proposes whether it is possible to automatically learn input representations in computer vision instead of relying on manually - designed schemes. - Although collective learning has been implicitly used in natural language processing (NLP) (for example, characters or words are associated with vectors), this area has not been fully studied in computer vision. ### Solutions To achieve deep collective learning, the author proposes "Lookup - VNets" (Lookup - Visual Networks). Lookup - VNets parameterize the input by associating each color in each channel with a vector. Specifically: - **Full Lookup Tables**: Each color is mapped to a different vector, maintaining the complete color space (256×256×256). - **Compressed Lookup Tables**: Reduce the size of the color space from 256×256×256 to ⌈256/c⌉×⌈256/c⌉×⌈256/c⌉ through the compression rate (CMP - Rate) c. ### Experimental Verification The author conducted a large number of experiments on four benchmark datasets (CIFAR - 10, CIFAR - 100, Tiny ImageNet, and ImageNet) to explore the following questions: - **Q1: What is the accuracy gap between Lookup - VNets and standard DNNs?** - **Q2: How does the vector dimension in the lookup table affect the performance of Lookup - VNets?** - **Q3: How does the compression rate (CMP - Rate) affect the performance of Lookup - VNets?** - **Q3: How does the learning strategy of the lookup table affect the performance of Lookup - VNets?** - **Q5: What do the images look like when they are represented by the learned lookup table?** ### Experimental Results - **Influence of Vector Dimension**: Experiments show that the vector dimension has almost no influence on the performance of Lookup - VNets, indicating that the vector dimension is not a key factor in determining performance. - **Influence of Compression Rate**: On CIFAR - 10, CIFAR - 100, and Tiny ImageNet, even when the compression rate reaches 4096 times, the performance does not decline significantly. And on the large - scale and challenging ImageNet dataset, Lookup - VNets perform better than standard DNNs. - **Overall Performance**: The performance of Lookup - VNets on multiple datasets is comparable to that of standard DNNs, and it even shows better performance on large - scale datasets (such as ImageNet). ### Conclusions This paper explores the possibility of automatically learning input representations in computer vision by proposing deep collective learning and Lookup - VNets. The experimental results show that Lookup - VNets can not only match the performance of standard DNNs, but also show advantages in large - scale and complex tasks.

Deep Collective Learning: Learning Optimal Inputs and Weights Jointly in Deep Neural Networks

Deep Vision Multimodal Learning: Methodology, Benchmark, and Trend

A Depth Estimation Framework Based on Unsupervised Learning and Cross-Modal Translation

Deep Learning for Content-Based Image Retrieval: A Comprehensive Study

Joint Learning for Face Alignment and Face Transfer with Depth Image

Editorial: Learning With Fewer Labels in Computer Vision

Joint Learning of Neural Networks Via Iterative Reweighted Least Squares

Bio-inspired deep neural local acuity and focus learning for visual image recognition

Deep eyes: Joint depth inference using monocular and binocular cues

Deep Unsupervised Learning of Visual Similarities

Learning compact generalizable neural representations supporting perceptual grouping

Deep Nets: What have they ever done for Vision?

Performance evaluation of deep feature learning for RGB-D image/video classification

Deep Isometric Learning for Visual Recognition

Everything You Wanted to Know about Deep Learning for Computer Vision but Were Afraid to Ask

Variational Structured Attention Networks for Deep Visual Representation Learning

On Learnable Parameters of Optimal and Suboptimal Deep Learning Models

Deep Learning Using Isotroping, Laplacing, Eigenvalues Interpolative Binding, and Convolved Determinants with Normed Mapping for Large-Scale Image Retrieval

Collective Intelligence for Deep Learning: A Survey of Recent Developments

Deep Learning vs. Traditional Computer Vision

Revisiting Deep Intrinsic Image Decompositions.