The Encoding Method of Position Embeddings in Vision Transformer

Kai Jiang,Peng,Youzao Lian,Weisheng Xu
DOI: https://doi.org/10.1016/j.jvcir.2022.103664
IF: 2.887
2022-01-01
Journal of Visual Communication and Image Representation
Abstract:In contrast to Convolutional Neural Networks (CNNs), Vision Transformers (ViT) cannot capture sequence ordering of input tokens and require position embeddings. As a learnable fixed-dimension vector, the position embedding improves accuracy while limiting the migration of the model between different input sizes. Hence, this paper conducts an empirical study on position embeddings of pre-trained models, which mainly focuses on two questions: (1) What do the position embeddings learn from training? (2) How do the position embeddings affect the self-attention modules?This paper analyzes the pattern of position embedding in pre-trained models and finds that the linear combination of Gabor filters and edge markers can fit the learned position embeddings well. The Gabor filters and edge markers can occupy some channels to append the position information, and the edge markers have flowed to values in self-attention modules. The experimental results can guide future work to choose suitable position embeddings.
What problem does this paper attempt to address?