Random Teachers are Good Teachers

Felix Sarnthein,Gregor Bachmann,Sotiris Anagnostidis,Thomas Hofmann
2023-06-19
Abstract:In this work, we investigate the implicit regularization induced by teacher-student learning dynamics in self-distillation. To isolate its effect, we describe a simple experiment where we consider teachers at random initialization instead of trained teachers. Surprisingly, when distilling a student into such a random teacher, we observe that the resulting model and its representations already possess very interesting characteristics; (1) we observe a strong improvement of the distilled student over its teacher in terms of probing accuracy. (2) The learned representations are data-dependent and transferable between different tasks but deteriorate strongly if trained on random inputs. (3) The student checkpoint contains sparse subnetworks, so-called lottery tickets, and lies on the border of linear basins in the supervised loss landscape. These observations have interesting consequences for several important areas in machine learning: (1) Self-distillation can work solely based on the implicit regularization present in the gradient dynamics without relying on any dark knowledge, (2) self-supervised learning can learn features even in the absence of data augmentation and (3) training dynamics during the early phase of supervised training do not necessarily require label information. Finally, we shed light on an intriguing local property of the loss landscape: the process of feature learning is strongly amplified if the student is initialized closely to the teacher. These results raise interesting questions about the nature of the landscape that have remained unexplored so far. Code is available at <a class="link-external link-https" href="https://github.com/safelix/dinopl" rel="external noopener nofollow">this https URL</a>.
Machine Learning
What problem does this paper attempt to address?
The problems that this paper attempts to solve mainly focus on understanding the implicit regularization effect in the teacher - student learning dynamics, especially during the self - distillation process. Specifically, the authors designed a simple experiment to isolate this effect, that is, using a randomly initialized teacher model in self - distillation instead of a trained teacher model. In this way, they hope to explore the following questions: 1. **The role of implicit regularization**: In the absence of "dark knowledge" (i.e., the extra information contained in the teacher model), can self - distillation still produce an effective student model? 2. **Data dependence and task transfer ability**: Are the representations learned by the student model guided by the random teacher model data - dependent, and can they be transferred between different tasks? 3. **The influence of initialization**: How does the initial position of the student model (relative to the teacher model) affect the final learning effect? In particular, when the student model is initialized closer to the teacher model, can the learning effect be significantly improved? 4. **Local characteristics of the loss landscape**: How is the feature learning process strongly amplified by the local characteristics of the loss landscape, especially when the student model is initialized close to the teacher model? Through these studies, the authors hope to reveal the underlying mechanisms in some important areas of machine learning, such as self - supervised learning, feature learning, and optimization dynamics. The answers to these questions will not only help to understand the effectiveness of self - distillation but may also provide new insights into other related fields.