Abstract:Individuals with hearing impairments often rely on non-verbal communication, including facial expressions and gestures. systems for Visual Speech Recognition (VSR) face challenges due to insufficient datasets and the complexity of extracting nuanced lip movements. In response, Our aim focuses on providing a two-fold framework, BlidAVS10. Firstly, we concentrate on the creation of a robust Arabic audio-visual dataset, comprising 1,383 videos. Secondly, we introduce an innovative approach to Arabic Audio-Visual Speech Recognition, leveraging BlidAVS10 for the development of various VSR systems. BlidAVS10 includes four key services: (1) the creation of a comprehensive dataset through video generation, (2) the detection, tracking, and extraction of the mouth region within each video frame, (3) the selection and customization of VSR models by developers, and (4) the building, training, and evaluation of our Deep Learning (DL) models, featuring a multi-layer Convolutional Neural Networks (CNN) model and a vision transformer (ViT). Our extensive experiments on BlidAVS10 showcase the effectiveness and reliability of our recognition techniques under varying environmental conditions. The dataset and DL-based VSR systems achieved a commendable accuracy rate of nearly 98%. This work introduces BlidAVS10, a groundbreaking audio-visual database, and offers a versatile framework with potential applications in assisting the hard of hearing, securing access through lipreading, enabling soundless communication with machines, and supporting the medical field in understanding the needs of laryngeal cancer patients.

Amazigh audiovisual speech recognition system design

An arabic visual speech recognition framework with CNN and vision transformers for lipreading

Amazigh CNN speech recognition system based on Mel spectrogram feature extraction method

Enhancing amazigh ASR through convolutional neural networks and MFCC

Object Recognition System for the Visually Impaired: A Deep Learning Approach using Arabic Annotation

Lip Localization and Viseme Classification for Visual Speech Recognition

Design and Implementation of a Real-Time Color Recognition System for the Visually Impaired

Design and implementation of smart voice assistant and recognizing academic words

Automatic Speech Recognition and its Visual Perception Via a Cymatics Based Display

Automatic lip-reading classification using deep learning approaches and optimized quaternion meixner moments by GWO algorithm

ViSpeR: Multilingual Audio-Visual Speech Recognition

Maghrebian dialect recognition based on support vector machines and neural network classifiers

A Performance Analysis of Face and Speech Recognition in the Video and Audio Stream Using Machine Learning Classification Techniques

Visual Passwords Using Automatic Lip Reading

Audio-Visual System for Robust Speaker Recognition.

NF-SAVO: Neuro-Fuzzy system for Arabic Video OCR

Resource aware design of a deep convolutional-recurrent neural network for speech recognition through audio-visual sensor fusion

VoxArabica: A Robust Dialect-Aware Arabic Speech Recognition System

Visual Methods for Sign Language Recognition: A Modality-Based Review

A Vision System for Multi-View Face Recognition

Assisting Blind People Using Object Detection with Vocal Feedback