A Spitting Image: Modular Superpixel Tokenization in Vision Transformers

Marius Aasan,Odd Kolbjørnsen,Anne Schistad Solberg,Adín Ramirez Rivera

2024-08-15

Abstract:Vision Transformer (ViT) architectures traditionally employ a grid-based approach to tokenization independent of the semantic content of an image. We propose a modular superpixel tokenization strategy which decouples tokenization and feature extraction; a shift from contemporary approaches where these are treated as an undifferentiated whole. Using on-line content-aware tokenization and scale- and shape-invariant positional embeddings, we perform experiments and ablations that contrast our approach with patch-based tokenization and randomized partitions as baselines. We show that our method significantly improves the faithfulness of attributions, gives pixel-level granularity on zero-shot unsupervised dense prediction tasks, while maintaining predictive performance in classification tasks. Our approach provides a modular tokenization framework commensurable with standard architectures, extending the space of ViTs to a larger class of semantically-rich models.

Computer Vision and Pattern Recognition,Artificial Intelligence,Machine Learning

What problem does this paper attempt to address?

The paper attempts to address the limitations of the grid-based tokenization method in traditional Vision Transformer (ViT) architectures. Specifically: 1. **Fixed-size tokenization**: Traditional ViTs use fixed square patches to segment images, which leads to a rigid binding of the model architecture to the tokenization scale, ignoring redundant information in the original image. This increases computational complexity and memory consumption, especially when processing high-resolution images. 2. **Neglect of semantic content**: The traditional grid tokenization method assumes a uniform distribution of semantic content in the image, which actually reduces spatial resolution and results in poor performance in dense prediction tasks. 3. **Lack of interpretability**: Attention maps based on square patches suffer from resolution loss when interpreting model decisions, failing to capture details of the original image and requiring additional decoders for pixel-level dense prediction. To address these issues, the paper proposes a Modular Superpixel Tokenization strategy, which can dynamically adapt to the content of the image, providing a richer tokenization scheme. This strategy significantly improves attribution fidelity and the performance of unsupervised dense prediction tasks while maintaining classification performance.

A Spitting Image: Modular Superpixel Tokenization in Vision Transformers

Superpixel Tokenization for Vision Transformers: Preserving Semantic Integrity in Visual Tokens

Vision Transformers with Natural Language Semantics

MSViT: Dynamic Mixed-Scale Tokenization for Vision Transformers

Making Vision Transformers Efficient from A Token Sparsification View

Vision Transformers with Mixed-Resolution Tokenization

Vision Transformer with Super Token Sampling

Sub-token ViT Embedding via Stochastic Resonance Transformers

Transformer with token attention and attribute prediction for image captioning

Content-aware Token Sharing for Efficient Semantic Segmentation with Vision Transformers

Dynamic Token-Pass Transformers for Semantic Segmentation

[Aggressive fibromatosis in childhood].

Not All Images Are Worth 16x16 Words: Dynamic Vision Transformers with Adaptive Sequence Length

Not All Patches are What You Need: Expediting Vision Transformers via Token Reorganizations

Token Turing Machines are Efficient Vision Models

DynamicViT: Efficient Vision Transformers with Dynamic Token Sparsification.

Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet

Super Vision Transformer

Not All Images are Worth 16x16 Words: Dynamic Transformers for Efficient Image Recognition

ViTAR: Vision Transformer with Any Resolution