Abstract:Workplace accidents due to personal protective equipment (PPE) non-compliance raise serious safety concerns and lead to legal liabilities, financial penalties, and reputational damage. While object detection models have shown the capability to address this issue by identifying safety items, most existing models, such as YOLO, Faster R-CNN, and SSD, are limited in verifying the fine-grained attributes of PPE across diverse workplace scenarios. Vision language models (VLMs) are gaining traction for detection tasks by leveraging the synergy between visual and textual information, offering a promising solution to traditional object detection limitations in PPE recognition. Nonetheless, VLMs face challenges in consistently verifying PPE attributes due to the complexity and variability of workplace environments, requiring them to interpret context-specific language and visual cues simultaneously. We introduce Clip2Safety, an interpretable detection framework for diverse workplace safety compliance, which comprises four main modules: scene recognition, the visual prompt, safety items detection, and fine-grained verification. The scene recognition identifies the current scenario to determine the necessary safety gear. The visual prompt formulates the specific visual prompts needed for the detection process. The safety items detection identifies whether the required safety gear is being worn according to the specified scenario. Lastly, the fine-grained verification assesses whether the worn safety equipment meets the fine-grained attribute requirements. We conduct real-world case studies across six different scenarios. The results show that Clip2Safety not only demonstrates an accuracy improvement over state-of-the-art question-answering based VLMs but also achieves inference times two hundred times faster.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is: **How to achieve fine - grained and interpretable personal protective equipment (PPE) compliance detection in diverse workplace environments**. Specifically, although existing object detection models (such as YOLO, Faster R - CNN, and SSD) can recognize safety items to a certain extent, they have limitations in verifying the fine - grained attributes of PPE. These models are usually difficult to cope with complex and changeable working environments and cannot accurately interpret language and visual cues in specific scenarios. Therefore, researchers need a more advanced and flexible framework to ensure the correct use and compliance of PPE in different workplaces. To solve this problem, the paper proposes a framework named **Clip2Safety**, which contains four main modules: 1. **Scene Recognition Module**: Recognize the current scene to determine the required protective equipment. 2. **Visual Hint Module**: Generate specific visual hints according to the detection requirements. 3. **Safety Item Detection Module**: Recognize whether the required safety equipment is worn. 4. **Fine - grained Verification Module**: Evaluate whether the worn safety equipment meets the requirements of fine - grained attributes. Through the collaborative work of these four modules, Clip2Safety not only improves the detection accuracy but also significantly shortens the inference time (two orders of magnitude faster than the existing state - of - the - art question - answering - based VLM). In addition, this framework can adapt to multiple working scenarios and provide detailed compliance evaluations, thereby effectively improving workplace safety. ### Main Contributions - Proposed the Clip2Safety framework, which takes advantage of visual language models (VLM) and object detection models to support a calibrated visual - text embedding space. - Designed the scene recognition module to ensure the match between scene - specific requirements and visual hints. - Introduced the visual hint module to enhance the model's adaptability to different safety compliance requirements. - Conducted experiments on real - world datasets of six different working scenarios, demonstrating high efficiency and effectiveness. ### Key Problems Solved 1. **Complex and Diverse Working Environments**: Different workplaces have different requirements for PPE. Clip2Safety solves this problem through the scene recognition module and the visual hint module. 2. **Fine - grained Attribute Verification**: It is not only necessary to detect the existence of PPE but also to verify its specific attributes, such as whether gloves are chemically resistant. 3. **Data Scarcity and Imbalance**: Through the zero - sample learning framework and pre - trained models, the dependence on a large amount of labeled data is reduced. In summary, this paper aims to solve the deficiencies of existing models in PPE compliance detection in diverse workplace environments by introducing the Clip2Safety framework, thereby improving work efficiency and safety.

Vision Language Model for Interpretable and Fine-grained Detection of Safety Compliance in Diverse Workplaces

Improved Vision-Based Method for Detection of Unauthorized Intrusion by Construction Sites Workers

Visual Detection of Personal Protective Equipment and Safety Gear on Industry Workers

Safety Alignment for Vision Language Models

VLSBench: Unveiling Visual Leakage in Multimodal Safety

Unraveling and Mitigating Safety Alignment Degradation of Vision-Language Models

ETA: Evaluating Then Aligning Safety of Vision Language Models at Inference Time

Revolutionizing Urban Safety Perception Assessments: Integrating Multimodal Large Language Models with Street View Images

Detecting Multimodal Situations with Insufficient Context and Abstaining from Baseless Predictions

VisionGPT: LLM-Assisted Real-Time Anomaly Detection for Safe Visual Navigation

A Multi-Modal Foundation Model to Assist People with Blindness and Low Vision in Environmental Interaction

Hard Cases Detection in Motion Prediction by Vision-Language Foundation Models

SPA-VL: A Comprehensive Safety Preference Alignment Dataset for Vision Language Model

Multimodal Situational Safety

Eyes Closed, Safety On: Protecting Multimodal LLMs via Image-to-Text Transformation

PPE detector: a YOLO-based architecture to detect personal protective equipment (PPE) for construction sites

Leveraging YOLO Models for Safety Equipment Detection on Construction Sites

Computer vision and long short-term memory: Learning to predict unsafe behaviour in construction

VLPD: Context-Aware Pedestrian Detection via Vision-Language Semantic Self-Supervision

Safety Fine-Tuning at (Almost) No Cost: A Baseline for Vision Large Language Models

LaMI-DETR: Open-Vocabulary Detection with Language Model Instruction