Abstract:One major challenge in training Deep Neural Networks is preventing overfitting. Many techniques such as data augmentation and novel regularizers such as Dropout have been proposed to prevent overfitting without requiring a massive amount of training data. In this work, we propose a new regularizer called DeCov which leads to significantly reduced overfitting (as indicated by the difference between train and val performance), and better generalization. Our regularizer encourages diverse or non-redundant representations in Deep Neural Networks by minimizing the cross-covariance of hidden activations. This simple intuition has been explored in a number of past works but surprisingly has never been applied as a regularizer in supervised learning. Experiments across a range of datasets and network architectures show that this loss always reduces overfitting while almost always maintaining or increasing generalization performance and often improving performance over Dropout.
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is **preventing over - fitting in deep neural networks (DNN)**. Specifically, the author proposes a new regularization method - **DeCov loss**, which reduces redundant representations by minimizing the cross - covariance between hidden - layer activations, thereby improving the generalization ability of the model.
### Problem Background
When training deep neural networks, a common challenge is **preventing over - fitting**. Even with a large amount of labeled data, deep networks are still prone to over - fitting. Over - fitting is manifested as the model performing well on the training set but poorly on the validation or test set. This is especially more obvious when dealing with new domains and new tasks, because each new task usually requires re - collecting and labeling a large amount of data.
### Existing Solutions
Currently, there are multiple techniques for preventing over - fitting, such as:
- **Data Augmentation**: Improve the robustness of the model by generating more diverse training data.
- **Dropout**: Randomly drop a part of neurons to prevent co - adaptation, that is, multiple hidden units rely on each other to perform certain functions, resulting in them being highly correlated.
- Traditional regularization methods such as **L2 Regularization**, **Lasso**, etc.
However, although these methods are effective, there is still room for improvement. In particular, the author observes an association between high correlations between hidden - layer activations and over - fitting, and thus proposes the idea of further improving the model's generalization ability by explicitly reducing this correlation.
### Proposed New Method
The author introduces **DeCov loss**, whose core idea is to encourage different, non - redundant representations by minimizing the Frobenius norm of the cross - covariance matrix between hidden - layer activations. The specific formula is as follows:
\[
C_{i,j}=\frac{1}{N}\sum_{n}(h_{n,i}-\mu_{i})(h_{n,j}-\mu_{j})
\]
Here, \(h_{n,i}\) represents the activation value of the \(n\) - th sample in the \(i\) - th hidden unit, and \(\mu_{i}\) is the mean of the \(i\) - th hidden unit. The final DeCov loss is defined as:
\[
L_{\text{DeCov}}=\frac{1}{2}(\|C\|_{F}^{2}-\|\text{diag}(C)\|_{2}^{2})
\]
Here, \(\|\cdot\|_{F}\) represents the Frobenius norm, and \(\text{diag}(C)\) extracts the diagonal elements of matrix \(C\).
### Experimental Results
The author verifies the effectiveness of DeCov through a series of experiments, including:
- **MNIST Bimodal Experiment**: In a synthetic task, predict two adjacent handwritten digits simultaneously. The results show that DeCov can significantly reduce over - fitting and improve generalization performance.
- **Image Classification Experiment**: Perform image classification tasks on datasets such as CIFAR10, CIFAR100, and ImageNet. Experiments show that DeCov not only improves the accuracy of the test set but also significantly reduces the gap between the training set and the validation set.
### Conclusion
This paper proposes a new regularization method - DeCov loss, which prevents over - fitting by reducing the correlation between hidden - layer activations and improves the generalization ability of the model. Experimental results show that DeCov can effectively reduce over - fitting on multiple datasets and network architectures, and usually can maintain or improve the performance of the model.