SAM-PM: Enhancing Video Camouflaged Object Detection using Spatio-Temporal Attention

Muhammad Nawfal Meeran,Gokul Adethya T,Bhanu Pratyush Mantha
2024-06-09
Abstract:In the domain of large foundation models, the Segment Anything Model (SAM) has gained notable recognition for its exceptional performance in image segmentation. However, tackling the video camouflage object detection (VCOD) task presents a unique challenge. Camouflaged objects typically blend into the background, making them difficult to distinguish in still images. Additionally, ensuring temporal consistency in this context is a challenging problem. As a result, SAM encounters limitations and falls short when applied to the VCOD task. To overcome these challenges, we propose a new method called the SAM Propagation Module (SAM-PM). Our propagation module enforces temporal consistency within SAM by employing spatio-temporal cross-attention mechanisms. Moreover, we exclusively train the propagation module while keeping the SAM network weights frozen, allowing us to integrate task-specific insights with the vast knowledge accumulated by the large model. Our method effectively incorporates temporal consistency and domain-specific expertise into the segmentation network with an addition of less than 1% of SAM's parameters. Extensive experimentation reveals a substantial performance improvement in the VCOD benchmark when compared to the most recent state-of-the-art techniques. Code and pre-trained weights are open-sourced at <a class="link-external link-https" href="https://github.com/SpiderNitt/SAM-PM" rel="external noopener nofollow">this https URL</a>
Computer Vision and Pattern Recognition,Artificial Intelligence
What problem does this paper attempt to address?
This paper attempts to address the challenges in Video Camouflaged Object Detection (VCOD). Specifically, the paper points out that although the Segment Anything Model (SAM) performs excellently in image segmentation tasks, it has limitations when dealing with video camouflaged object detection. These limitations mainly include: 1. **Dataset Bias**: A large number of visual datasets used for SAM training mainly contain objects with clear boundaries and lack the representation of camouflaged objects (with blurred and indistinguishable boundaries). 2. **Static Image Training**: SAM is mainly trained on static images, which makes it perform poorly in capturing motion and maintaining temporal consistency between consecutive video frames. 3. **Background Fusion Problem**: Camouflaged objects are usually highly similar to the background, which leads to two fundamental problems: - The boundaries of the object blend seamlessly with the background and only become obvious when moving. - Objects usually have repetitive textures similar to their surrounding environment, which makes the optical - flow - based methods prone to errors when estimating pixel motion. To solve these problems, the paper proposes a new method - SAM Propagation Module (SAM - PM). SAM - PM enhances the temporal consistency of SAM by introducing a spatio - temporal cross - attention mechanism and only trains the propagation module while freezing the weights of the SAM network, thus effectively integrating domain - specific knowledge into the segmentation network while increasing the number of parameters by less than 1%. Experimental results show that SAM - PM significantly outperforms the latest techniques in VCOD benchmark tests.