A Spitting Image: Modular Superpixel Tokenization in Vision Transformers

Marius Aasan,Odd Kolbjørnsen,Anne Schistad Solberg,Adín Ramirez Rivera
2024-08-15
Abstract:Vision Transformer (ViT) architectures traditionally employ a grid-based approach to tokenization independent of the semantic content of an image. We propose a modular superpixel tokenization strategy which decouples tokenization and feature extraction; a shift from contemporary approaches where these are treated as an undifferentiated whole. Using on-line content-aware tokenization and scale- and shape-invariant positional embeddings, we perform experiments and ablations that contrast our approach with patch-based tokenization and randomized partitions as baselines. We show that our method significantly improves the faithfulness of attributions, gives pixel-level granularity on zero-shot unsupervised dense prediction tasks, while maintaining predictive performance in classification tasks. Our approach provides a modular tokenization framework commensurable with standard architectures, extending the space of ViTs to a larger class of semantically-rich models.
Computer Vision and Pattern Recognition,Artificial Intelligence,Machine Learning
What problem does this paper attempt to address?
The paper attempts to address the limitations of the grid-based tokenization method in traditional Vision Transformer (ViT) architectures. Specifically: 1. **Fixed-size tokenization**: Traditional ViTs use fixed square patches to segment images, which leads to a rigid binding of the model architecture to the tokenization scale, ignoring redundant information in the original image. This increases computational complexity and memory consumption, especially when processing high-resolution images. 2. **Neglect of semantic content**: The traditional grid tokenization method assumes a uniform distribution of semantic content in the image, which actually reduces spatial resolution and results in poor performance in dense prediction tasks. 3. **Lack of interpretability**: Attention maps based on square patches suffer from resolution loss when interpreting model decisions, failing to capture details of the original image and requiring additional decoders for pixel-level dense prediction. To address these issues, the paper proposes a Modular Superpixel Tokenization strategy, which can dynamically adapt to the content of the image, providing a richer tokenization scheme. This strategy significantly improves attribution fidelity and the performance of unsupervised dense prediction tasks while maintaining classification performance.