Abstract:This paper evaluates the use of vision-language models (VLMs) for zero-shot detection and association of hardhats to enhance construction safety. Given the significant risk of head injuries in construction, proper enforcement of hardhat use is critical. We investigate the applicability of foundation models, specifically OWLv2, for detecting hardhats in real-world construction site images. Our contributions include the creation of a new benchmark dataset, Hardhat Safety Detection Dataset, by filtering and combining existing datasets and the development of a cascaded detection approach. Experimental results on 5,210 images demonstrate that the OWLv2 model achieves an average precision of 0.6493 for hardhat detection. We further analyze the limitations and potential improvements for real-world applications, highlighting the strengths and weaknesses of current foundation models in safety perception domains.

What problem does this paper attempt to address?

### What problems does this paper attempt to solve? This paper aims to evaluate the application of Vision - Language Models (VLMs) in zero - shot detection and association of hard hats, in order to improve the safety of construction sites. Specifically, the researchers hope to use base models (such as OWLv2) to detect whether workers on construction sites are wearing hard hats, thereby reducing the risk of head injuries. #### Main problems: 1. **Necessity of hard - hat detection**: A large number of casualties occur on construction sites every year due to not wearing hard hats. Although there are regulations requiring the wearing of hard hats, the actual implementation is not satisfactory. Therefore, how to ensure that workers wear hard hats correctly through technical means has become an urgent problem to be solved. 2. **Feasibility of zero - shot learning**: Traditional object detection methods rely on a large amount of manually labeled data, which is time - consuming and costly. The zero - shot learning method based on Vision - Language Models can perform detection without specific category labels, which provides a new idea for solving safety perception problems in the real world. 3. **Limitations of existing datasets**: Existing datasets have problems such as inconsistent and incomplete labeling, resulting in unreliable detection performance. Therefore, a new benchmark dataset needs to be created to better evaluate the performance of these models. #### Research contributions: 1. **Creation of a new dataset**: The researchers created a new benchmark dataset named "Hardhat Safety Detection Dataset", which solves the problem of inconsistent labeling by filtering and combining existing datasets. 2. **Development of a cascaded detection method**: A cascaded detection method is proposed. First, people in the image are detected, and then the scope is gradually narrowed down to detect the head and hard hat. This method can automatically associate high - level classes (such as people) with their low - level attributes (such as head and hard hat). 3. **Experimental verification and analysis**: Through experiments on 5,210 images, the performance of the OWLv2 model in hard - hat detection is verified, and its advantages and limitations are analyzed, providing a basis for further improvement. #### Summary: The main goal of this paper is to evaluate the applicability of Vision - Language Models in zero - shot detection and association of hard hats, especially in high - risk environments such as construction sites. By creating a new dataset and developing a cascaded detection method, the researchers hope to improve the accuracy and reliability of hard - hat detection in practical applications, thereby reducing the occurrence of work - related accidents.

Evaluating Cascaded Methods of Vision-Language Models for Zero-Shot Detection and Association of Hardhats for Increased Construction Safety

Improved Vision-Based Method for Detection of Unauthorized Intrusion by Construction Sites Workers

Evaluating Vision-Language Models for Zero-Shot Detection, Classification, and Association of Motorcycles, Passengers, and Helmets

Toward Efficient Safety Helmet Detection Based on YoloV5 with Hierarchical Positive Sample Selection and Box Density Filtering

Automatic detection of hardhats worn by construction personnel: A deep learning approach and benchmark dataset

Helmet Wearing State Detection Based on Improved Yolov5s

Visual Detection of Personal Protective Equipment and Safety Gear on Industry Workers

Hardhat-Wearing Detection Based on a Lightweight Convolutional Neural Network with Multi-Scale Features and a Top-Down Module

Authentication control system for the efficient detection of hard-hats using deep learning algorithms

Hard Cases Detection in Motion Prediction by Vision-Language Foundation Models

SAFETY HELMET WEARING DETECTION BASED ON AN IMPROVED YOLOV3 SCHEME

GroundVLP: Harnessing Zero-shot Visual Grounding from Vision-Language Pre-training and Open-Vocabulary Object Detection

CA-CentripetalNet: A novel anchor-free deep learning framework for hardhat wearing detection

Personal Protective Equipment Detection for Construction Workers: A Novel Dataset and Enhanced YOLOv5 Approach

Vision Language Model for Interpretable and Fine-grained Detection of Safety Compliance in Diverse Workplaces

Fast Personal Protective Equipment Detection for Real Construction Sites Using Deep Learning Approaches

YOLO-ESCA: A High-Performance Safety Helmet Standard Wearing Behavior Detection Model Based on Improved YOLOv5

Hardhat Detection Using IR and Depth Frames

Investigation Into Recognition Algorithm of Helmet Violation Based on YOLOv5-CBAM-DCN

Safety helmet detection based on improved YOLOv7-tiny with multiple feature enhancement

Revisiting Few-Shot Object Detection with Vision-Language Models