Abstract:Visually Impaired (VI) people around the world have difficulties in socializing and traveling due to the limitation of traditional assistive tools. In recent years, practical assistance systems for scene text detection and recognition allow VI people to obtain text information from surrounding scenes. However, real-world scene text features complex background, low resolution, variable fonts as well as irregular arrangement which make it difficult to achieve robust scene text detection and recognition. In this paper, a scene text recognition system to help VI people is proposed. Firstly, we propose a high-performance neural network to detect and track objects, which is applied to specific scenes to obtain Regions of Interest (ROI). In order to achieve real-time detection, a light-weight deep neural network has been built using depth-wise separable convolutions that enables the system to be integrated into mobile devices with limited computational resources. Secondly, we train the neural network using the textural features to improve the precision of text detection. Our algorithm suppresses the effects of spatial transformation (including translation, scaling, rotation as well as other geometric transformations) based on the spatial transformer networks. Open-source optical character recognition (OCR) is used to train scene texts individually to improve the accuracy of text recognition. The interactive system eventually transfers the number and distance information of inbound buses to visually impaired people. Finally, a comprehensive set of experiments on several benchmark datasets demonstrates that our algorithm has achieved an extraordinary trade-off between precision and resource usage.

Research on visual automatic speech recognition

Scene Text Detection and Recognition System for Visually Impaired People in Real World

AudioVSR: Enhancing Video Speech Recognition with Audio Data

Cross-modal Mask Fusion and Modality-Balanced Audio-Visual Speech Recognition

Exploring Spatio-Temporal Representations by Integrating Attention-based Bidirectional-LSTM-RNNs and FCNs for Speech Emotion Recognition

Multi-objects Real Time Recognition Based on Color Information

An Investigation into Audio–Visual Speech Recognition under a Realistic Home–TV Scenario

Research on Visual Speech Feature Extraction

Speaker-Independent Speech Emotion Recognition Based On Cnn-Blstm And Multiple Svms

Real-Time facial expression recognition system based on HMM and feature point localization

Visual Features Extracting & Selecting For Lipreading

Hearing Lips in Noise: Universal Viseme-Phoneme Mapping and Transfer for Robust Audio-Visual Speech Recognition

Audio-visual Keyword Spotting for Mandarin Based on Discriminative Local Spatial-Temporal Descriptors.

Audio-visual multi-channel speech separation, dereverberation and recognition

Visual Information Assisted Mandarin Large Vocabulary Continuous Speech Recognition

Can We Read Speech Beyond the Lips? Rethinking RoI Selection for Deep Visual Speech Recognition

Tri-Modal Speech Recognition for Noisy and Variable Lighting Conditions

Audio-Visual System for Robust Speaker Recognition.

VHASR: A Multimodal Speech Recognition System With Vision Hotwords

Audio-visual End-to-end Multi-channel Speech Separation, Dereverberation and Recognition

Deep Learning for Visual Speech Analysis: A Survey