Random Teachers are Good Teachers

Felix Sarnthein,Gregor Bachmann,Sotiris Anagnostidis,Thomas Hofmann

2023-06-19

Abstract:In this work, we investigate the implicit regularization induced by teacher-student learning dynamics in self-distillation. To isolate its effect, we describe a simple experiment where we consider teachers at random initialization instead of trained teachers. Surprisingly, when distilling a student into such a random teacher, we observe that the resulting model and its representations already possess very interesting characteristics; (1) we observe a strong improvement of the distilled student over its teacher in terms of probing accuracy. (2) The learned representations are data-dependent and transferable between different tasks but deteriorate strongly if trained on random inputs. (3) The student checkpoint contains sparse subnetworks, so-called lottery tickets, and lies on the border of linear basins in the supervised loss landscape. These observations have interesting consequences for several important areas in machine learning: (1) Self-distillation can work solely based on the implicit regularization present in the gradient dynamics without relying on any dark knowledge, (2) self-supervised learning can learn features even in the absence of data augmentation and (3) training dynamics during the early phase of supervised training do not necessarily require label information. Finally, we shed light on an intriguing local property of the loss landscape: the process of feature learning is strongly amplified if the student is initialized closely to the teacher. These results raise interesting questions about the nature of the landscape that have remained unexplored so far. Code is available at <a class="link-external link-https" href="https://github.com/safelix/dinopl" rel="external noopener nofollow">this https URL</a>.

Machine Learning

What problem does this paper attempt to address?

The problems that this paper attempts to solve mainly focus on understanding the implicit regularization effect in the teacher - student learning dynamics, especially during the self - distillation process. Specifically, the authors designed a simple experiment to isolate this effect, that is, using a randomly initialized teacher model in self - distillation instead of a trained teacher model. In this way, they hope to explore the following questions: 1. **The role of implicit regularization**: In the absence of "dark knowledge" (i.e., the extra information contained in the teacher model), can self - distillation still produce an effective student model? 2. **Data dependence and task transfer ability**: Are the representations learned by the student model guided by the random teacher model data - dependent, and can they be transferred between different tasks? 3. **The influence of initialization**: How does the initial position of the student model (relative to the teacher model) affect the final learning effect? In particular, when the student model is initialized closer to the teacher model, can the learning effect be significantly improved? 4. **Local characteristics of the loss landscape**: How is the feature learning process strongly amplified by the local characteristics of the loss landscape, especially when the student model is initialized close to the teacher model? Through these studies, the authors hope to reveal the underlying mechanisms in some important areas of machine learning, such as self - supervised learning, feature learning, and optimization dynamics. The answers to these questions will not only help to understand the effectiveness of self - distillation but may also provide new insights into other related fields.

Random Teachers are Good Teachers

Self-Distillation as Instance-Specific Label Smoothing

Revisiting Self-Distillation

On student-teacher deviations in distillation: does it pay to disobey?

Self-Distillation for Randomized Neural Networks

Understanding the Gains from Repeated Self-Distillation

Restructuring the Teacher and Student in Self-Distillation

Knowledge Distillation Meets Self-Supervision

Dual teachers for self-knowledge distillation

Logit Distillation via Student Diversity.

Boosting Knowledge Distillation Via Intra-class Logit Distribution Smoothing

Subclass Distillation

Self Regulated Learning Mechanism for Data Efficient Knowledge Distillation

Self-Distillation from the Last Mini-Batch for Consistency Regularization

Progressive distillation induces an implicit curriculum

How a student becomes a teacher: learning and forgetting through spectral methods

Learn From the Past: Experience Ensemble Knowledge Distillation

Undistillable: Making A Nasty Teacher That CANNOT teach students

Linear Projections of Teacher Embeddings for Few-Class Distillation

DCD: Discriminative and Consistent Representation Distillation

Revisiting Knowledge Distillation Via Label Smoothing Regularization