Abstract:We explore the effect of introducing prior information into the intermediate level of neural networks for a learning task on which all the state-of-the-art machine learning algorithms tested failed to learn. We motivate our work from the hypothesis that humans learn such intermediate concepts from other individuals via a form of supervision or guidance using a curriculum. The experiments we have conducted provide positive evidence in favor of this hypothesis. In our experiments, a two-tiered MLP architecture is trained on a dataset with 64x64 binary inputs images, each image with three sprites. The final task is to decide whether all the sprites are the same or one of them is different. Sprites are pentomino tetris shapes and they are placed in an image with different locations using scaling and rotation transformations. The first part of the two-tiered MLP is pre-trained with intermediate-level targets being the presence of sprites at each location, while the second part takes the output of the first part as input and predicts the final task's target binary event. The two-tiered MLP architecture, with a few tens of thousand examples, was able to learn the task perfectly, whereas all other algorithms (include unsupervised pre-training, but also traditional algorithms like SVMs, decision trees and boosting) all perform no better than chance. We hypothesize that the optimization difficulty involved when the intermediate pre-training is not performed is due to the {\em composition} of two highly non-linear tasks. Our findings are also consistent with hypotheses on cultural learning inspired by the observations of optimization problems with deep learning, presumably because of effective local minima.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is: **How to help deep - supervised neural networks overcome optimization obstacles in certain learning tasks by introducing prior information (i.e., intermediate concepts)**. Specifically, the paper explores the reasons why black - box algorithms cannot be successfully learned directly in some complex machine - learning tasks, and proposes a solution, that is, to assist the learning process by providing the guidance of intermediate concepts. ### Problem Background The author points out that in some specific learning tasks, the existing state - of - the - art machine - learning algorithms (including deep neural networks) perform poorly, and may even not exceed the level of random guessing. These tasks are characterized by the combination of multiple highly nonlinear subtasks, which easily lead to getting trapped in local minima or encountering other optimization obstacles during the optimization process. ### Specific Task To verify this hypothesis, the author designed an experimental task: in a 64×64 binary image dataset, each image contains three sprites in the shape of Pentomino. The task is to determine whether these three sprites have the same shape. This task is relatively simple for humans, but very difficult for machine - learning algorithms. ### Solution The solution proposed by the author is to guide the learning process by introducing intermediate concepts. Specifically: 1. **Hierarchical Architecture**: Use a two - layer multi - layer perceptron (MLP) architecture, where the first layer is responsible for identifying the sprite categories in each 8×8 image block, and the second layer makes the final binary classification decision based on the output of the first layer. 2. **Intermediate Objectives**: During the training process, intermediate objectives (i.e., intermediate concepts) regarding the sprite categories in each image block are provided for the first layer. This guidance enables the neural network to learn and optimize more effectively. ### Experimental Results The experimental results show that when intermediate concepts are provided as guidance, the neural network can successfully complete the task, while the neural network without such guidance cannot learn. This indicates that **introducing prior information and intermediate concepts can significantly improve the learning effect and help the model overcome optimization obstacles**. ### Conclusion The main contribution of this paper lies in revealing the nature of optimization difficulties in some complex tasks and verifying the effectiveness of introducing intermediate concepts through experiments. This provides new ideas for future research, especially on how to improve the learning effect through appropriate guidance when dealing with complex and abstract tasks. ### Key Formulas - **Output Calculation Formula of P1NN**: \[ f_{\theta}(p_i) = g_2(V \cdot g_1(U \cdot p_i + b) + c) \] where: - \( p_i \in \mathbb{R}^d \) is the input patch extracted from image position \( i \). - \( U \in \mathbb{R}^{d_h \times d} \) is the weight matrix of the first layer. - \( b \in \mathbb{R}^{d_h} \) is the bias vector of the first layer. - \( g_1(\cdot) \) and \( g_2(\cdot) \) are the activation functions of the first layer and the second layer respectively. - \( V \in \mathbb{R}^{d_o \times d_h} \) is the weight matrix of the second layer. - \( c \in \mathbb{R}^{d_o} \) is the bias vector of the second layer. - **Calculation Formula of the Normalization Layer**: \[ z(h(i)_o(x_j))=\frac{h(i)_o(x_j)-\mu_{h(i)_o}}{\max(\sigma_{h(i)_o},\epsilon)} \] where: - \( \mu_{h(i)_o}=\frac{1}{N}\sum_{x_j\in X}h(i)_o(x_j) \) - \( \sigma_{h(i)_o}=\sqrt{\frac{1}{N}\sum_{j = 1}^N(h(i)_o(x_j)-\mu_{h(i)_o})^2+\epsilon} \)

Knowledge Matters: Importance of Prior Information for Optimization

Informed Pre-Training on Prior Knowledge

Drop Redundant, Shrink Irrelevant: Selective Knowledge Injection for Language Pretraining

Worth of prior knowledge for enhancing deep learning

Infusing Expert Knowledge Into a Deep Neural Network Using Attention Mechanism for Personalized Learning Environments

Scaling MLPs: A Tale of Inductive Bias

Worth of knowledge in deep learning

Knowledge-Adaptation Priors

Encoding priors in the brain: a reinforcement learning model for mouse decision making

A Flexible Framework for Designing Trainable Priors with Adaptive Smoothing and Game Encoding

Leveraging Prior Concept Learning Improves Generalization From Few Examples in Computational Models of Human Object Recognition

The Perils of Learning Before Optimizing

Evolving Culture vs Local Minima

Information Bottleneck Theory Based Exploration of Cascade Learning

Memory Aware Synapses: Learning what (not) to forget

Learning by Turning: Neural Architecture Aware Optimisation

Early learning of the optimal constant solution in neural networks and humans

An Empirical Investigation of Catastrophic Forgeting in Gradient-Based Neural Networks.

Interleaving Learning, with Application to Neural Architecture Search

Understanding and Improving Optimization in Predictive Coding Networks

Training Multi-Layer Perceptron with Enhanced Brain Storm Optimization Metaheuristics