Abstract:Dual- (or multiple) rear cameras on hand-held smartphones are believed to be the future of mobile photography. Recently, many of such new has been released (mainly with dual-rear cameras: one wide-angle and one telephoto). Some of the notable ones are Apple iPhone 7 and 8 Plus, iPhone X, Samsung Galaxy S9, LG V30, Huawei Mate 10. With built-in dual-camera systems, these devices are capable of not only producing better quality picture but also acquiring 3D stereo photos (with depth information collected). Thus, they are capable of capturing the moment in life with depth just like our two eye system. Thanks to this current trend, these phones are now getting cheaper while becoming more power complete. In this paper, we describe a system that makes use of the commercial dual rear-camera phones such as the iPhone X, to provide aids for people who are visually impaired. We propose a design to place the phone on the chest centre of the user who has one or two Bluetooth headphone(s) plugged into the ears to listen to the phone audio outputs. Our system is consist of three modules: (1) the scene context recognition to audio, (2) the 3D stereo reconstruction to audio, and (3) the interactive audio/voice controls. In slightly more detail, the wide-angle camera captures live photos to be investigated by a GPS guided Deep Learning process to describe the scene in front of him/herself (module 1). The telephoto camera captures the more narrow-angle and thus to be stereo reconstructed with the aids of the wide angle's one to form a depth map (densed area-based distance map). The map helps determine the distance to all visible object(s) to notify the user with critical ones (module 2). This module also makes the phone vibrate when an object(s) located close enough to the user, e.g. within hand reach distance. The user can also query the system by asking various questions to get automatic voice answering (module 3). In addition, a manual rescue module (module 4) is also added when other things have gone wrong. An example of the vision to audio could be ”Overall, likely a corridor, one medium object is 0.5 m away - central left”, or ”Overall, city pathway, front cleared”. Audio command input may be ”read texts”, and the phone will detect and read all texts on closest object. More details on the design and implementation are further described in this paper.

Vision-referential speech enhancement of an audio signal using mask information captured as visual data

Cross-modal Mask Fusion and Modality-Balanced Audio-Visual Speech Recognition

Audio-Visual Speech Enhancement with Deep Multi-modality Fusion

A Vision Aid for the Visually Impaired using Commodity Dual-Rear-Camera Smartphones

Audio-Visual Speech Codecs: Rethinking Audio-Visual Speech Enhancement by Re-Synthesis

Visual Speech Enhancement

Visual Facial Enhancements Can Significantly Improve Speech Perception in the Presence of Noise

Audio-Visual Speech Enhancement With Selective Off-Screen Speech Extraction

Audio-Visual Target Speaker Enhancement on Multi-Talker Environment using Event-Driven Cameras

A Supervised Speech Enhancement Method for Smartphone-Based Binaural Hearing Aids

Visual Hallucination Elevates Speech Recognition

Real-Time Speech Enhancement for Mobile Communication Based on Dual-Channel Complex Spectral Mapping

Audio-visual End-to-end Multi-channel Speech Separation, Dereverberation and Recognition

Speech Reconstruction With Reminiscent Sound Via Visual Voice Memory

Cooperative Dual Attention for Audio-Visual Speech Enhancement with Facial Cues

A Facial Feature and Lip Movement Enhanced Audio-Visual Speech Separation Model

SVoice: Enabling Voice Communication in Silence Via Acoustic Sensing on Commodity Devices.

Incorporating Visual Information Reconstruction into Progressive Learning for Optimizing audio-visual Speech Enhancement

Masking and Inpainting: A Two-Stage Speech Enhancement Approach for Low SNR and Non-Stationary Noise

SVoice

Deep Learning Based Real-Time Speech Enhancement for Dual-Microphone Mobile Phones