Evaluating Cascaded Methods of Vision-Language Models for Zero-Shot Detection and Association of Hardhats for Increased Construction Safety

Lucas Choi,Ross Greer
2024-10-16
Abstract:This paper evaluates the use of vision-language models (VLMs) for zero-shot detection and association of hardhats to enhance construction safety. Given the significant risk of head injuries in construction, proper enforcement of hardhat use is critical. We investigate the applicability of foundation models, specifically OWLv2, for detecting hardhats in real-world construction site images. Our contributions include the creation of a new benchmark dataset, Hardhat Safety Detection Dataset, by filtering and combining existing datasets and the development of a cascaded detection approach. Experimental results on 5,210 images demonstrate that the OWLv2 model achieves an average precision of 0.6493 for hardhat detection. We further analyze the limitations and potential improvements for real-world applications, highlighting the strengths and weaknesses of current foundation models in safety perception domains.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
### What problems does this paper attempt to solve? This paper aims to evaluate the application of Vision - Language Models (VLMs) in zero - shot detection and association of hard hats, in order to improve the safety of construction sites. Specifically, the researchers hope to use base models (such as OWLv2) to detect whether workers on construction sites are wearing hard hats, thereby reducing the risk of head injuries. #### Main problems: 1. **Necessity of hard - hat detection**: A large number of casualties occur on construction sites every year due to not wearing hard hats. Although there are regulations requiring the wearing of hard hats, the actual implementation is not satisfactory. Therefore, how to ensure that workers wear hard hats correctly through technical means has become an urgent problem to be solved. 2. **Feasibility of zero - shot learning**: Traditional object detection methods rely on a large amount of manually labeled data, which is time - consuming and costly. The zero - shot learning method based on Vision - Language Models can perform detection without specific category labels, which provides a new idea for solving safety perception problems in the real world. 3. **Limitations of existing datasets**: Existing datasets have problems such as inconsistent and incomplete labeling, resulting in unreliable detection performance. Therefore, a new benchmark dataset needs to be created to better evaluate the performance of these models. #### Research contributions: 1. **Creation of a new dataset**: The researchers created a new benchmark dataset named "Hardhat Safety Detection Dataset", which solves the problem of inconsistent labeling by filtering and combining existing datasets. 2. **Development of a cascaded detection method**: A cascaded detection method is proposed. First, people in the image are detected, and then the scope is gradually narrowed down to detect the head and hard hat. This method can automatically associate high - level classes (such as people) with their low - level attributes (such as head and hard hat). 3. **Experimental verification and analysis**: Through experiments on 5,210 images, the performance of the OWLv2 model in hard - hat detection is verified, and its advantages and limitations are analyzed, providing a basis for further improvement. #### Summary: The main goal of this paper is to evaluate the applicability of Vision - Language Models in zero - shot detection and association of hard hats, especially in high - risk environments such as construction sites. By creating a new dataset and developing a cascaded detection method, the researchers hope to improve the accuracy and reliability of hard - hat detection in practical applications, thereby reducing the occurrence of work - related accidents.