Abstract:Construction workplace hazard detection requires engineers to analyze scenes manually against many safety rules, which is time-consuming, labor-intensive, and error-prone. Computer vision algorithms are yet to achieve reliable discrimination of anomalous and benign object relations underpinning safety violation detections. Recently developed deep learning-based computer vision algorithms need tens of thousands of images, including labels of the safety rules violated, in order to train deep-learning networks for acquiring spatiotemporal reasoning capacity in complex workplaces. Such training processes need human experts to label images and indicate whether the relationship between the worker, resource, and equipment in the scenes violate spatiotemporal arrangement rules for safe and productive operations. False alarms in those manual labels (labeling no-violation images as having violations) can significantly mislead the machine learning process and result in computer vision models that produce inaccurate hazard detections. Compared with false alarms, another type of mislabels, false negatives (labeling images having violations as "no violations"), seem to have fewer impacts on the reliability of the trained computer vision models. This paper examines a new crowdsourcing approach that achieves above 95% accuracy in labeling images of complex construction scenes having safety-rule violations, with a focus on minimizing false alarms while keeping acceptable rates of false negatives. The development and testing of this new crowdsourcing approach examine two fundamental questions: (1) How to characterize the impacts of a short safety-rule training process on the labeling accuracy of non-professional image annotators? And (2) How to properly aggregate the image labels contributed by ordinary people to filter out false alarms while keeping an acceptable false negative rate? In designing short training sessions for online image annotators, the research team split a large number of safety rules into smaller sets of six. An online image annotator learns six safety rules randomly assigned to him or her, and then labels workplace images as "no violation" or 'violation" of certain rules among the six learned by him or her. About one hundred and twenty anonymous image annotators participated in the data collection. Finally, a Bayesian-network-based crowd consensus model aggregated these labels from annotators to obtain safety-rule violation labeling results. Experiment results show that the proposed model can achieve close to 0% false alarm rates while keeping the false negative rate below 10%. Such image labeling performance outdoes existing crowdsourcing approaches that use majority votes for aggregating crowdsourced labels. Given these findings, the presented crowdsourcing approach sheds lights on effective construction safety surveillance by integrating human risk recognition capabilities into advanced computer vision.

MLLM-as-a-Judge for Image Safety without Human Labeling

MM-SafetyBench: A Benchmark for Safety Evaluation of Multimodal Large Language Models

SafeBench: A Safety Evaluation Framework for Multimodal Large Language Models

MLLMGuard: A Multi-dimensional Safety Evaluation Suite for Multimodal Large Language Models

MLLM-Protector: Ensuring MLLM's Safety without Hurting Performance

Safety of Multimodal Large Language Models on Images and Texts

Safety of Multimodal Large Language Models on Images and Text.

Eyes Closed, Safety On: Protecting Multimodal LLMs via Image-to-Text Transformation

Multimodal Situational Safety

Unbridled Icarus: A Survey of the Potential Perils of Image Inputs in Multimodal Large Language Model Security

ICM-Assistant: Instruction-tuning Multimodal Large Language Models for Rule-based Explainable Image Content Moderation

Online Safety Analysis for LLMs: a Benchmark, an Assessment, and a Path Forward

VLSBench: Unveiling Visual Leakage in Multimodal Safety

ChineseSafe: A Chinese Benchmark for Evaluating Safety in Large Language Models

Safe + Safe = Unsafe? Exploring How Safe Images Can Be Exploited to Jailbreak Large Vision-Language Models

CoCA: Regaining Safety-awareness of Multimodal Large Language Models with Constitutional Calibration

MLLM-as-a-Judge: Assessing Multimodal LLM-as-a-Judge with Vision-Language Benchmark

Crowdsourced Reliable Labeling of Safety-Rule Violations on Images of Complex Construction Scenes for Advanced Vision-Based Workplace Safety

Using Multimodal Large Language Models (MLLMs) for Automated Detection of Traffic Safety-Critical Events

R-Judge: Benchmarking Safety Risk Awareness for LLM Agents

Advancing Content Moderation: Evaluating Large Language Models for Detecting Sensitive Content Across Text, Images, and Videos