Abstract:Recently, scaling images to high resolution has received much attention in multimodal large language models (MLLMs). Most existing practices adopt a sliding-window-style cropping strategy to adapt to resolution increase. Such a cropping strategy, however, can easily cut off objects and connected regions, which introduces semantic discontinuity and therefore impedes MLLMs from recognizing small or irregularly shaped objects or text, leading to a phenomenon we call the semantic sawtooth effect. This effect is particularly evident in lightweight MLLMs. To address this issue, we introduce a Complementary Image Pyramid (CIP), a simple, effective, and plug-and-play solution designed to mitigate semantic discontinuity during high-resolution image processing. In particular, CIP dynamically constructs an image pyramid to provide complementary semantic information for the cropping-based MLLMs, enabling them to richly acquire semantics at all levels. Furthermore, we introduce a Scale Compression Mechanism (SCM) to reduce the additional computational overhead by compressing the redundant visual tokens. Our experiments demonstrate that CIP can consistently enhance the performance across diverse architectures (e.g., MiniCPM-V-2, InternVL2, and LLaVA-OneVision), various model capacity (1B$\rightarrow$8B), and different usage configurations (training-free and fine-tuning). Leveraging the proposed CIP and SCM, we introduce a lightweight MLLM, Mini-Monkey, which achieves remarkable performance in both general multimodal understanding and document understanding. On the OCRBench, the 2B-version Mini-Monkey even surpasses the 8B model InternVL2-8B by 12 score. Additionally, training Mini-Monkey is cheap, requiring only eight RTX 3090 GPUs. The code is available at <a class="link-external link-https" href="https://github.com/Yuliang-Liu/Monkey" rel="external noopener nofollow">this https URL</a>.

Mean Box Pooling: A Rich Image Representation and Output Embedding for the Visual Madlibs Task

Combining Multiple Cues for Visual Madlibs Question Answering

Multimodal Compact Bilinear Pooling for Visual Question Answering and Visual Grounding

Compact Tensor Pooling for Visual Question Answering

Wasserstein Pooling for Image Classification

Enhancing Sentence Embedding with Generalized Pooling

Cross-convolutional-layer Pooling for Generic Visual Recognition.

Enhancing high-vocabulary image annotation with a novel attention-based pooling

Beyond Bilinear: Generalized Multimodal Factorized High-Order Pooling for Visual Question Answering

Self-Attentive Pooling for Efficient Deep Learning

Deep CNNs Meet Global Covariance Pooling: Better Representation and Generalization

Multi-modal Factorized Bilinear Pooling with Co-attention Learning for Visual Question Answering.

Object Level Deep Feature Pooling for Compact Image Representation

Mini-Monkey: Alleviating the Semantic Sawtooth Effect for Lightweight MLLMs via Complementary Image Pyramid

Two-stage Pooling of Deep Convolutional Features for Image Retrieval.

InfMLLM: A Unified Framework for Visual-Language Tasks.

Solving Visual Madlibs with Multiple Cues

Context-aware Attentional Pooling (CAP) for Fine-grained Visual Classification

CompCap: Improving Multimodal Large Language Models with Composite Captions

Combining Local and Global: Rich and Robust Feature Pooling for Visual Recognition.

Efficient Large Multi-modal Models via Visual Context Compression