Audio-visual speech recognition integrating 3D lip information obtained from the Kinect

Wang, Jianrong; Ju, Zhang; Honda, Kiyoshi; Wei, Jianguo; Dang, Jianwu

doi:10.1007/s00530-015-0499-9

Cited by 21 publications

(10 citation statements)

References 12 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Also called as match pair in which both samples in a pair belong to the same identity10 Also called as non-match pair in which samples in a pair belong to the different identity…”

mentioning

confidence: 99%

3D Convolutional Neural Networks for Cross Audio-Visual Matching Recognition

et al. 2017

View full text Add to dashboard Cite

Audio-visual recognition (AVR) has been considered as a solution for speech recognition tasks when the audio is corrupted, as well as a visual recognition method used for speaker verification in multi-speaker scenarios. The approach of AVR systems is to leverage the extracted information from one modality to improve the recognition ability of the other modality by complementing the missing information. The essential problem is to find the correspondence between the audio and visual streams, which is the goal of this work. We propose the use of a coupled 3D Convolutional Neural Network (3D-CNN) architecture that can map both modalities into a representation space to evaluate the correspondence of audio-visual streams using the learned multimodal features. The proposed architecture will incorporate both spatial and temporal information jointly to effectively find the correlation between temporal information for different modalities. By using a relatively small network architecture and much smaller dataset for training, our proposed method surpasses the performance of the existing similar methods for audio-visual matching which use 3D CNNs for feature representation. We also demonstrate that an effective pair selection method can significantly increase the performance. The proposed method achieves relative improvements over 20% on the Equal Error Rate (EER) and over 7% on the Average Precision (AP) in comparison to the state-of-the-art method.

show abstract

“…Also called as match pair in which both samples in a pair belong to the same identity10 Also called as non-match pair in which samples in a pair belong to the different identity…”

mentioning

confidence: 99%

3D Convolutional Neural Networks for Cross Audio-Visual Matching Recognition

et al. 2017

View full text Add to dashboard Cite

show abstract

“…Palecek [9] proposed depth-based active appearance model (AAM) features and improved the accuracy over DCT. Wang et al [10] used the features based on 3D lip points obtained from Kinect. These methods is more suitable for real applications.…”

Section: Related Workmentioning

confidence: 99%

“…Recently, depth cameras such as Microsoft Kinect has become available with a small cost. They were also used for multimodal speech recognition [8], [9], [10]. In this study, we aim to improve the performance of multimodal speech recognition using depth cameras.…”

Section: Introductionmentioning

confidence: 99%

Multimodal speech recognition using mouth images from depth camera

Yasui

Inoue

Iwano

et al. 2017

2017 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)

View full text Add to dashboard Cite

Deep learning has been proved to be effective in multimodal speech recognition using facial frontal images. In this paper, we propose a new deep learning method, a trimodal deep autoencoder, which uses not only audio signals and face images, but also depth images of faces, as the inputs. We collected continuous speech data from 20 speakers with Kinect 2.0 and used them for our evaluation. The experimental results with 10dB SNR showed that our method reduced errors by 30%, from 34.6% to 24.2% from audio-only speech recognition when SNR was 10dB. In particular, it is effective for recognizing some consonants including /k/, /t/.

show abstract

“…In recent years, various AVSR modeling techniques [4,5,6,7,8,9,10] have been developed and yielded an impressive improvement over the ASR systems using only audio in an adverse environment. Conventional AVSR systems based on these approaches require highly specialized audio-visual (AV) data in both system training and evaluation.…”

Section: Introductionmentioning

confidence: 99%

Semi-supervised Cross-domain Visual Feature Learning for Audio-Visual Broadcast Speech Transcription

Liu

Wang

2018

Interspeech 2018

View full text Add to dashboard Cite

Visual information can be incorporated into automatic speech recognition (ASR) systems to improve their robustness in adverse acoustic conditions. Conventional audiovisual speech recognition (AVSR) systems require highly specialized audiovisual (AV) data in both system training and evaluation. For many real-world speech recognition applications, only audio information is available. This presents a major challenge to a wider application of AVSR systems. In order to address this challenge, this paper proposes a semi-supervised visual feature learning approach for developing AVSR systems on a DARPA GALE Mandarin broadcast transcription task. Audio to visual feature inversion long short-term memory neural networks (L-STMs) were initially constructed using limited amounts of out of domain AV data. The acoustic features domain mismatch against the broadcast data was further reduced using multi-level domain adaptive deep networks. Visual features were then automatically generated from the broadcast speech audio and used in both AVSR system training and testing time. Experimental results suggest a CNN based AVSR system using the proposed semi-supervised cross-domain audio-to-visual feature generation technique outperformed the baseline audio only CNN ASR system by an average CER reduction of 6.8% relative. In particular, on the most difficult Phoenix TV subset, a CER reduction of 1.32% absolute (8.34% relative) was obtained.

show abstract

Audio-visual speech recognition integrating 3D lip information obtained from the Kinect

Cited by 21 publications

References 12 publications

3D Convolutional Neural Networks for Cross Audio-Visual Matching Recognition

3D Convolutional Neural Networks for Cross Audio-Visual Matching Recognition

Multimodal speech recognition using mouth images from depth camera

Semi-supervised Cross-domain Visual Feature Learning for Audio-Visual Broadcast Speech Transcription

Contact Info

Product

Resources

About