Abstract:In recent years, several Japanese companies have attempted to improve the efficiency of their meetings, which has been a significant challenge. For instance, voice recognition technology is used to considerably improve meeting minutes creation. In an automatic minutes-creating system, identifying the speaker to add speaker information to the text would substantially improve the overall efficiency of the process. Therefore, a few companies and research groups have proposed speaker estimation methods; however, it includes challenges, such as requiring advance preparation, special equipment, and multiple microphones. These problems can be solved by using speech sections that are extracted from lip movements and voice information. When a person speaks, voice and lip movements occur simultaneously. Therefore, the speaker’s speech section can be extracted from videos by using lip movement and voice information. However, when this speech section contains only voice information, the voiceprint information of each meeting participant is required for speaker identification. When using lip movements, the speech section and speaker position can be extracted without the voiceprint information. Therefore, in this study, we propose a speech-section extraction method that uses image and voice information in Japanese for speaker identification. The proposed method consists of three processes: i) the extraction of speech frames using lip movements, ii) the extraction of speech frames using voices, and iii) the classification of speech sections using these extraction results. We used video data to evaluate the functionality of the method. Further, the proposed method was compared with state-of-the-art techniques. The average F-measure of the proposed method is determined to be higher than that of the conventional methods that are based on state-of-the-art techniques. The evaluation results showed that the proposed method achieves state-of-the-art performance using a simpler process compared to the conventional method.

Speaker-Independent English Consonant and Japanese Word Recognition by a Stochastic Dynamic Time Warping Method

Research on Speaker-Depended Isolated-Word Speech Recognition System

Simplified Deformation Compensation for Emotional Speaker Recognition

Speech recognition using Dynamic Time Warping (DTW)

Application of dynamic time warping optimization algorithm in speech recognition of machine translation

Wavoice: A mmWave-assisted Noise-resistant Speech Recognition SystemJust Accepted

A Large-Vocabulary Chinese Speech Recognition System.

Wavoice: an Mmwave-Assisted Noise-Resistant Speech Recognition System.

Voice Recognition Algorithms using Mel Frequency Cepstral Coefficient (MFCC) and Dynamic Time Warping (DTW) Techniques

Wavelet-Based Mel-Frequency Cepstral Coefficients for Speaker Identification using Hidden Markov Models

Spatial Correlation Transformation for Speech Recognition

One-against-all weighted dynamic time warping for language-independent and speaker-dependent speech recognition in adverse conditions

Speaker Recognition with Random Digit Strings Using Uncertainty Normalized HMM-based i-vectors

An Approach To Robust Speaker Recognition Using Stochastic Matching

The influence of lexical characteristics and talker accent on the recognition of English words by speakers of Japanese

Speech-Section Extraction Using Lip Movement and Voice Information in Japanese

The speaking rate adaptation algorithm in Putonghua continuous speech recognition

Syllable based DNN-HMM Cantonese Speech to Text System

A New Method in Hidden Markov Model for Modeling Frame Correlation

Text-independent Speaker Recognition Based on Self-adaptation Compensation Transformation

Probabilistic Speaker-Class Based Acoustic Modeling for Large Vocabulary Continuous Speech Recognition