ForgeryTTT: Zero-Shot Image Manipulation Localization with Test-Time Training

Weihuang Liu,Xi Shen,Chi-Man Pun,Xiaodong Cun
2024-10-05
Abstract:Social media is increasingly plagued by realistic fake images, making it hard to trust content. Previous algorithms to detect these fakes often fail in new, real-world scenarios because they are trained on specific datasets. To address the problem, we introduce ForgeryTTT, the first method leveraging test-time training (TTT) to identify manipulated regions in images. The proposed approach fine-tunes the model for each individual test sample, improving its performance. ForgeryTTT first employs vision transformers as a shared image encoder to learn both classification and localization tasks simultaneously during the training-time training using a large synthetic dataset. Precisely, the localization head predicts a mask to highlight manipulated areas. Given such a mask, the input tokens can be divided into manipulated and genuine groups, which are then fed into the classification head to distinguish between manipulated and genuine parts. During test-time training, the predicted mask from the localization head is used for the classification head to update the image encoder for better adaptation. Additionally, using the classical dropout strategy in each token group significantly improves performance and efficiency. We test ForgeryTTT on five standard benchmarks. Despite its simplicity, ForgeryTTT achieves a 20.1% improvement in localization accuracy compared to other zero-shot methods and a 4.3% improvement over non-zero-shot techniques. Our code and data will be released upon publication.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: **How to accurately locate the forged areas in an image in the zero - sample case, especially in the face of the problem that existing methods are difficult to adapt to new scenarios when facing the ever - evolving forgery techniques**. ### Problem Background With the rapid development of image - editing techniques, forged images are becoming more and more realistic, making social media full of indistinguishable false content. Existing detection algorithms usually rely on specific datasets for training, and thus often perform poorly in the face of new, real - world scenarios. Specifically: - **Existing Challenges**: - Existing methods perform poorly in new scenarios because they are usually trained on specific datasets. - New forgery techniques (such as generative adversarial networks) make it more difficult to collect comprehensive forged samples. - Existing image - manipulation - location algorithms are difficult to adapt to the ever - evolving forgery techniques. ### Solutions Proposed in the Paper To address the above challenges, the paper proposes a new method named **ForgeryTTT**, which uses Test - Time Training (TTT) to improve the performance of image - manipulation - location. Specifically: - **Main Innovations**: - **Introduce TTT for Image - Manipulation - Location for the First Time**: By fine - tuning each test sample at test time, the model can better adapt to new scenarios. - **Multi - task Framework**: Combines image - manipulation - classification and - location tasks, using a shared Vision Transformer as an encoder. - **Self - supervised Learning**: Divides the input tokens into forged and real groups through the predicted masks and then classifies them to update the image encoder. - **Random Token - Dropping Strategy**: Improves performance and efficiency by randomly dropping some tokens in the foreground and background tokens. ### Method Overview 1. **Training Phase**: - Train the model using a large - scale synthetic dataset while learning image - manipulation - location and - classification tasks. - The location head is responsible for predicting the masks of the forged areas, and the classification head is responsible for distinguishing between the forged and real parts. 2. **Training Phase**: - Fine - tune each test sample and use the predicted masks to update the encoder. - Construct pseudo - training samples by randomly dropping tokens to further optimize the model. ### Experimental Results The experimental results show that ForgeryTTT significantly outperforms other zero - sample and non - zero - sample methods on multiple benchmark datasets, especially in terms of location accuracy and adaptation to new forgery techniques. For example, the experimental results on five standard benchmark datasets show that ForgeryTTT improves the location accuracy by 20.1% compared to other zero - sample methods and by 4.3% compared to non - zero - sample methods. ### Summary The main contributions of the paper include: - Proposing the first TTT framework specifically for zero - sample image - manipulation - location. - Designing a self - supervised - based task to enhance the location ability. - Verifying the effectiveness of this method on multiple benchmark datasets, significantly surpassing existing methods. Through these innovations, the paper provides new ideas and technical means for solving the key problems in image - forgery detection.