Analyzing The Language of Visual Tokens

David M. Chan,Rodolfo Corona,Joonyong Park,Cheol Jun Cho,Yutong Bai,Trevor Darrell
2024-11-08
Abstract:With the introduction of transformer-based models for vision and language tasks, such as LLaVA and Chameleon, there has been renewed interest in the discrete tokenized representation of images. These models often treat image patches as discrete tokens, analogous to words in natural language, learning joint alignments between visual and human languages. However, little is known about the statistical behavior of these visual languages - whether they follow similar frequency distributions, grammatical structures, or topologies as natural languages. In this paper, we take a natural-language-centric approach to analyzing discrete visual languages and uncover striking similarities and fundamental differences. We demonstrate that, although visual languages adhere to Zipfian distributions, higher token innovation drives greater entropy and lower compression, with tokens predominantly representing object parts, indicating intermediate granularity. We also show that visual languages lack cohesive grammatical structures, leading to higher perplexity and weaker hierarchical organization compared to natural languages. Finally, we demonstrate that, while vision models align more closely with natural languages than other models, this alignment remains significantly weaker than the cohesion found within natural languages. Through these experiments, we demonstrate how understanding the statistical properties of discrete visual languages can inform the design of more effective computer vision models.
Computer Vision and Pattern Recognition,Artificial Intelligence,Computation and Language,Machine Learning
What problem does this paper attempt to address?
### Problems the Paper Attempts to Solve This paper aims to explore the statistical behavior differences between visual languages and natural languages. Specifically, the researchers focus on the following questions: 1. **Do visual languages follow statistical laws similar to natural languages?** - Do they follow similar frequency distributions (e.g., Zipf's law)? - Do they have similar grammatical structures? - Do they have similar semantic dependencies? 2. **Is the internal structure of visual languages similar to that of natural languages?** - Do "vocabularies" in visual languages (i.e., image blocks) operate like words in natural languages? - What are the entropy and compression rates of visual languages? - What is the token innovation rate of visual languages? 3. **What impact do these statistical behaviors have on the design of computer vision models?** - How can these statistical characteristics be utilized to design more effective computer vision models? - What insights do the high entropy and low compression rates of visual languages provide for model design? Through these questions, the researchers hope to reveal the unique properties of visual languages and provide a theoretical foundation and technical guidance for future research and applications.