Probing Vision and Language Models for Construction Waste Material Recognition

Ying Sun,Zhaolin Gu,Sean Bin Yang
DOI: https://doi.org/10.1016/j.autcon.2024.105629
IF: 10.3
2024-01-01
Automation in Construction
Abstract:Motivated by the critical role of automatic sorting in construction waste management, recent advancements have leveraged deep learning's ability to capture powerful features within unimodality-based recognition approaches. However, existing methods remain limited due to reliance on solely image-based datasets, restricting feature expression. To solve this, this paper introduces the VL-CSW dataset by considering both image and text modalities. Next, this paper proposes ConCLIP, , a vision-and-language model tailored for CSW recognition. ConCLIP incorporates a pre-feature interaction network for enhanced modality-specific feature learning and leverages a bidirectional contrastive training paradigm alongside supervised task training to optimize its performance across both modalities. Evaluation on VL-CSW datasets demonstrates the ConCLIP's 's superiority on CSW material classification task, significantly outperforming strong baselines in most settings. Notably, ConCLIP achieves performance improvements of 1.83% and 3.41% compared to unimodality methods in VL-Concrete and VL-Metal classification tasks, respectively, highlighting the efficacy of multi-modality in enhancing automatic sorting system performance.
What problem does this paper attempt to address?