LFSamba: Marry SAM with Mamba for Light Field Salient Object Detection

Zhengyi Liu,Longzhen Wang,Xianyong Fang,Zhengzheng Tu,Linbo Wang
2024-11-11
Abstract:A light field camera can reconstruct 3D scenes using captured multi-focus images that contain rich spatial geometric information, enhancing applications in stereoscopic photography, virtual reality, and robotic vision. In this work, a state-of-the-art salient object detection model for multi-focus light field images, called LFSamba, is introduced to emphasize four main insights: (a) Efficient feature extraction, where SAM is used to extract modality-aware discriminative features; (b) Inter-slice relation modeling, leveraging Mamba to capture long-range dependencies across multiple focal slices, thus extracting implicit depth cues; (c) Inter-modal relation modeling, utilizing Mamba to integrate all-focus and multi-focus images, enabling mutual enhancement; (d) Weakly supervised learning capability, developing a scribble annotation dataset from an existing pixel-level mask dataset, establishing the first scribble-supervised baseline for light field salient object <a class="link-external link-http" href="http://detection.https" rel="external noopener nofollow">this http URL</a>://github.com/liuzywen/LFScribble
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
This paper attempts to solve the problem of salient object detection in multi - focus light - field images. Specifically, the author proposes a new model named LFSamba, aiming to improve the detection effect of salient objects in multi - focus light - field images by combining SAM (Segment Anything Model) and Mamba. The following are the specific problems that this paper attempts to solve: 1. **Effective Feature Extraction**: - Multi - focus light - field images contain rich spatial geometric information, but how to extract this information efficiently is a challenge. The author uses SAM to extract modality - aware discriminative features to enhance the feature extraction ability. 2. **Inter - slice Relationship Modeling**: - Multi - focus images are composed of multiple focal - plane slices, and each slice is focused at different depth positions. In order to capture the long - range dependencies between these slices and extract the implicit depth cues, the author introduces the Mamba model. 3. **Cross - modal Relationship Modeling**: - In order to better fuse all focal - plane images and multi - focus images, the author designs a cross - modal Mamba model to achieve mutual enhancement between different modal features. 4. **Weakly - Supervised Learning Ability**: - Annotation is an important step for deep - learning models to learn the potential mapping from input to output. Existing methods usually require dense annotation, resulting in high labor costs. To solve this problem, the author constructs a sparsely - annotated dataset and develops a weakly - supervised learning method, thereby reducing the annotation cost. In summary, the LFSamba model solves the above problems in the following aspects: - Use SAM for efficient feature extraction. - Utilize the Mamba model to capture long - range dependencies in multi - focus images. - Design a cross - modal Mamba model to fuse features of different modalities. - Construct a sparsely - annotated dataset and adopt a weakly - supervised learning method to reduce the annotation workload. Through these improvements, LFSamba has achieved significant performance improvement in the task of salient object detection in multi - focus light - field images.