PD-APE: A Parallel Decoding Framework with Adaptive Position Encoding for 3D Visual Grounding

Chenshu Hou,Liang Peng,Xiaopei Wu,Xiaofei He,Wenxiao Wang
2024-09-02
Abstract:3D visual grounding aims to identify objects in 3D point cloud scenes that match specific natural language descriptions. This requires the model to not only focus on the target object itself but also to consider the surrounding environment to determine whether the descriptions are met. Most previous works attempt to accomplish both tasks within the same module, which can easily lead to a distraction of attention. To this end, we propose PD-APE, a dual-branch decoding framework that separately decodes target object attributes and surrounding layouts. Specifically, in the target object branch, the decoder processes text tokens that describe features of the target object (e.g., category and color), guiding the queries to pay attention to the target object itself. In the surrounding branch, the queries align with other text tokens that carry surrounding environment information, making the attention maps accurately capture the layout described in the text. Benefiting from the proposed dual-branch design, the queries are allowed to focus on points relevant to each branch's specific objective. Moreover, we design an adaptive position encoding method for each branch respectively. In the target object branch, the position encoding relies on the relative positions between seed points and predicted 3D boxes. In the surrounding branch, the attention map is additionally guided by the confidence between visual and text features, enabling the queries to focus on points that have valuable layout information. Extensive experiments demonstrate that we surpass the state-of-the-art on two widely adopted 3D visual grounding datasets, ScanRefer and Nr3D.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problems that this paper attempts to solve are two main challenges in 3D Visual Grounding: 1. **Entanglement of target object features and spatial layout features**: In previous methods, the attribute features (such as shape and color) of the target object and the spatial layout features are intertwined. This makes it difficult for the model to distinguish different types of attention, thus affecting the accuracy of positioning. For example, when the model focuses on the target object and its surrounding environment simultaneously, it may disperse its attention to objects of the same category but in different positions. 2. **Text information fails to effectively guide the visual cross - attention module**: Existing methods fail to fully utilize text information to guide the attention of query points when processing visual features. This makes the query points only be able to roughly learn the features of neighboring points, causing the model to focus on redundant or irrelevant spatial layout information and further dispersing the attention. To solve these problems, the author proposes a new framework - PD - APE (Parallel Decoding Framework with Adaptive Position Encoding). This framework improves the 3D visual positioning task in the following ways: - **Two - branch decoder**: A parallel decoder is designed, which contains two branches. One branch is used to decode the features of the target object, and the other branch is used to perceive the layout of the surrounding environment. This design enables the model to focus on the target object and the surrounding environment separately, avoiding the dispersion of attention. - **Adaptive position encoding**: Different adaptive position encoding methods are designed for each branch. For the target object branch, the position encoding is based on the relative position between the seed point and the predicted 3D box; for the surrounding environment branch, the position encoding also combines the confidence between visual and text features to better capture the spatial layout information in the description. Through these improvements, PD - APE can outperform the existing state - of - the - art methods on two widely - used 3D visual positioning datasets, ScanRefer and Nr3D, and achieve new best performance.