Abstract:Image manipulation can lead to misinterpretation of visual content, posing significant risks to information security. Image Manipulation Localization (IML) has thus received increasing attention. However, existing IML methods rely heavily on task-specific designs, making them perform well only on one target image type but are mostly random guessing on other image types, and even joint training on multiple image types causes significant performance degradation. This hinders the deployment for real applications as it notably increases maintenance costs and the misclassification of image types leads to serious error accumulation. To this end, we propose Omni-IML, the first generalist model to unify diverse IML tasks. Specifically, Omni-IML achieves generalism by adopting the Modal Gate Encoder and the Dynamic Weight Decoder to adaptively determine the optimal encoding modality and the optimal decoder filters for each sample. We additionally propose an Anomaly Enhancement module that enhances the features of tampered regions with box supervision and helps the generalist model to extract common features across different IML tasks. We validate our approach on IML tasks across three major scenarios: natural images, document images, and face images. Without bells and whistles, our Omni-IML achieves state-of-the-art performance on all three tasks with a single unified model, providing valuable strategies and insights for real-world application and future research in generalist image forensics. Our code will be publicly available.

What problem does this paper attempt to address?

This paper attempts to solve an important problem in the field of Image Manipulation Localization (IML): existing IML methods perform poorly on different types of images, and their performance drops significantly when jointly training multiple image types. Specifically, current IML models are usually designed for specific types of images (such as natural - style images, document images, and face images). These models are almost equivalent to random guessing when processing other types of images for which they are not designed, resulting in serious error accumulation and high maintenance costs. To address this challenge, the authors propose Omni - IML, a general - purpose model aimed at uniformly handling multiple IML tasks. Omni - IML achieves this goal through the following innovative modules: 1. **Modal Gate Encoder**: Automatically selects the best encoding modality (frequency + visual or pure visual) for each input sample to adapt to the characteristics of different types of images. 2. **Anomaly Enhancement**: Enhances the features of the forged area by introducing bounding - box supervision, helping the model extract common features across different IML tasks. 3. **Dynamic Weight Decoder**: Adaptively selects the best decoder filter for each sample, reducing conflicts in unified training. Through these designs, Omni - IML can achieve high - performance forgery localization simultaneously in three main scenarios: natural images, document images, and face images, without task - specific or benchmark - specific fine - tuning. Experimental results show that Omni - IML achieves state - of - the - art performance in all three tasks, significantly outperforming previous specialized methods for individual tasks. ### Formula Summary Some of the formulas involved in the paper are as follows: - Loss function of the modal gate encoder: \[ L_{MG} = CE(P_{rgb}, L_m)+CE(P_{fused}, L_m)+CE(P_{cls}, L_c) \] where \(L_c\) is defined as: \[ L_c=\begin{cases} 1 & \text{if } IoU(P_{rgb}, L_m)>IoU(P_{fused}, L_m)+ 0.1\\ 0 & \text{otherwise} \end{cases} \] - Loss function of the dynamic weight decoder: \[ L_{DWD}=CE(P_{DWD}, L_m)+CE(P_{co}, L_m) \] These formulas ensure the model's optimization and generalization ability on different tasks.

Omni-IML: Towards Unified Image Manipulation Localization

Generic Image Manipulation Localization Through the Lens of Multi-scale Spatial Inconsistence

PROMPT-IML: Image Manipulation Localization with Pre-trained Foundation Models Through Prompt Tuning

OmniEdit: Building Image Editing Generalist Models Through Specialist Supervision

Multi-view Feature Extraction Via Tunable Prompts is Enough for Image Manipulation Localization

GIM: A Million-scale Benchmark for Generative Image Manipulation Detection and Localization

Image-to-Image Matching via Foundation Models: A New Perspective for Open-Vocabulary Semantic Segmentation

IML-ViT: Benchmarking Image Manipulation Localization by Vision Transformer

OmniGen: Unified Image Generation

Cross-Modal Omni Interaction Modeling for Phrase Grounding

OmniGuard: Hybrid Manipulation Localization via Augmented Versatile Deep Image Watermarking

OmniGlue: Generalizable Feature Matching with Foundation Model Guidance

OmniBench: Towards The Future of Universal Omni-Language Models

OSMLoc: Single Image-Based Visual Localization in OpenStreetMap with Geometric and Semantic Guidances

OmniBooth: Learning Latent Control for Image Synthesis with Multi-modal Instruction

Pixel-Inconsistency Modeling for Image Manipulation Localization

InfMLLM: A Unified Framework for Visual-Language Tasks.

EAN: Edge-Aware Network for Image Manipulation Localization

OmiQnet: Multiscale feature aggregation convolutional neural network for omnidirectional image assessment

Multi-modality boundary-guided network for generalizable image manipulation localization

OmniCLIP: Adapting CLIP for Video Recognition with Spatial-Temporal Omni-Scale Feature Learning