Abstract:In this work, we present a comprehensive analysis of causal image modeling and introduce the Adventurer series models where we treat images as sequences of patch tokens and employ uni-directional language models to learn visual representations. This modeling paradigm allows us to process images in a recurrent formulation with linear complexity relative to the sequence length, which can effectively address the memory and computation explosion issues posed by high-resolution and fine-grained images. In detail, we introduce two simple designs that seamlessly integrate image inputs into the causal inference framework: a global pooling token placed at the beginning of the sequence and a flipping operation between every two layers. Extensive empirical studies demonstrate the significant efficiency and effectiveness of this causal image modeling paradigm. For example, our base-sized Adventurer model attains a competitive test accuracy of 84.0% on the standard ImageNet-1k benchmark with 216 images/s training throughput, which is 5.3 times more efficient than vision transformers to achieve the same result.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is how to achieve efficient visual understanding through Causal Image Modeling (CIM). Specifically, the researchers proposed a new image processing method, which regards the image as a series of patch tokens and uses a unidirectional language model to learn visual representations. This method aims to solve the problem of memory and computational explosion in high - resolution and fine - grained image processing. ### Main problems 1. **Efficiently processing high - resolution images**: Traditional Vision Transformers (ViTs) face the problem of a sharp increase in computational complexity and memory consumption when processing high - resolution images. The method proposed in the paper effectively addresses this challenge through causal modeling with linear complexity. 2. **Information imbalance problem**: In a unidirectional causal model, tokens at the beginning of the sequence have difficulty accessing the global context, resulting in poor representation quality. To this end, the paper introduced two simple mechanisms - the global average pooling token and the inter - layer flipping operation - to alleviate this problem. 3. **Improving model efficiency**: The paper shows that causal modeling can significantly reduce redundant computations while maintaining accuracy comparable to that of the standard ViT. In addition, by combining RNN - like token mixers (such as Mamba), the computational efficiency and memory utilization efficiency can be further improved. ### Solution overview - **Causal modeling framework**: Divide the image into non - overlapping patches to form a one - dimensional token sequence and process it using a unidirectional language model. In this way, the model can maintain linear complexity when processing long sequences. - **Global average pooling token**: Place a global average pooling token at the beginning of the input sequence of each Adventurer layer, which contains the average value of all other tokens, thereby providing sufficient global information for the tokens at the beginning of the sequence. - **Inter - layer flipping operation**: Flip the order of patch tokens between every two Adventurer blocks to offset the local information imbalance caused by position differences. ### Experimental results Experiments show that the Adventurer model has achieved competitive performance in the ImageNet benchmark test. For example, the Base - sized model has reached a test accuracy of 84.0% at an input size of 448×448, and the training speed is 5.3 times faster than that of ViT. In addition, the Adventurer model shows a significant speed advantage when processing long sequences of more than 3,000 tokens. ### Summary This paper effectively solves the computational and memory problems in high - resolution image processing by introducing causal image modeling and two simple mechanisms while maintaining good visual understanding performance. This provides new ideas and methods for the design of more efficient visual models in the future.

Causal Image Modeling for Efficient Visual Understanding

Causal-ViT: Robust Vision Transformer by causal intervention

Learning 1D Causal Visual Representation with De-focus Attention Networks

Interpreting Low-level Vision Models with Causal Effect Maps

Dependent Multi-Task Learning with Causal Intervention for Image Captioning.

Vision-and-Language Navigation via Causal Learning

Unifying (Machine) Vision via Counterfactual World Modeling

Causal Attention for Vision-Language Tasks

Causal Reasoning Meets Visual Representation Learning: A Prospective Study

Show, Deconfound and Tell: Image Captioning with Causal Inference

CRViT: Vision transformer advanced by causality and inductive bias for image recognition

Multimodal Causal Reasoning Benchmark: Challenging Vision Large Language Models to Infer Causal Links Between Siamese Images

Learning Invariant Causal Mechanism from Vision-Language Models

Causal Graphical Models for Vision-Language Compositional Understanding

Disentanglement of Latent Representations via Causal Interventions

A hierarchical and contextual model for learning and recognizing highly variant visual categories

Causal Generative Explainers using Counterfactual Inference: A Case Study on the Morpho-MNIST Dataset

Modular Learning of Deep Causal Generative Models for High-dimensional Causal Inference

Causal Interventional Training for Image Recognition

Making Every Frame Matter: Continuous Video Understanding for Large Models via Adaptive State Modeling

CELLO: Causal Evaluation of Large Vision-Language Models