An Improvement in Audio-Visual Voice Activity Detection for Automatic Speech Recognition

Yoshida, Takami; Nakadai, Kazuhiro; Okuno, Hiroshi G.

doi:10.1007/978-3-642-13022-9_6

Cited by 7 publications

(8 citation statements)

References 13 publications

(12 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Systemic hardware and software implementation details can be found in the references of the chapter and other related references contained therein. 38 2 Intelligent Control System Architectures…”

Section: Discussionmentioning

confidence: 99%

Sociorobot World

Tzafestas¹

2016

Intelligent Systems, Control and Automation: Science and Engineering

View full text Add to dashboard Cite

Section: Discussionmentioning

confidence: 99%

Sociorobot World

Tzafestas¹

2016

Intelligent Systems, Control and Automation: Science and Engineering

View full text Add to dashboard Cite

“…Let be a Gaussian mixture PDF, given by: (26) where is the number of Gaussian components, are the mixture weights that sum to one, and is the PDF of the th Gaussian component, given by: (27) where is the dimension of and is the determinant of . We assume two such GMMs, one for the speech absence hypothesis, , and the other for the speech presence hypothesis, .…”

Section: A Unimodal Estimation Of Speech Presence Indicatormentioning

confidence: 99%

“…The modalities are fused in the features level using a weighted sum and the combined audio-visual feature is compared to a threshold for the classification. Another approach for AV-VAD which is also designed for incorporation in an SRS was presented in [26]. The audio signal is represented by a feature based on a likelihood score for silence which is evaluated in the SRS based on recognition scores, and the video features are based on the width and the height of the lips.…”

Section: Introductionmentioning

confidence: 99%

Audio-Visual Voice Activity Detection Using Diffusion Maps

Dov

Talmon

Cohen

2015

IEEE/ACM Trans. Audio Speech Lang. Process.

View full text Add to dashboard Cite

The performance of traditional voice activity detectors significantly deteriorates in the presence of highly nonstationary noise and transient interferences. One solution is to incorporate a video signal which is invariant to the acoustic environment. Although several voice activity detectors based on the video signal were recently presented, merely few detectors which are based on both the audio and the video signals exist in the literature to date. In this paper, we present an audio-visual voice activity detector and show that the incorporation of both audio and video signals is highly beneficial for voice activity detection. The algorithm is based on a supervised learning procedure, and a labeled training data set is considered. The algorithm comprises a feature extraction procedure, where the features are designed to separate speech from nonspeech frames. Diffusion maps is applied separately and similarly to the features of each modality and builds a low dimensional representation. Using the new representation, we propose a measure for voice activity which is based on a supervised learning procedure and the variability between adjacent frames in time. The measures of the two modalities are merged to provide voice activity detection based on both the audio and the video signals. Experimental results demonstrate the improved performance of the proposed algorithm compared to state-of-the-art detectors.Index Terms-Audio-visual speech processing, diffusion maps, voice activity detection.

show abstract

“…To solve the issues in AV-VAD, we introduced AV-VAD based on Bayesian network [13], because Bayesian network provides a framework that integrates multiple features with some ambiguities by maximizing the likelihood of the total integrated system. Actually, we used the following features as the inputs of the Bayesian network:…”

Section: A Audio-visual Integration For Vadmentioning

confidence: 99%

“…This feature reported high noiserobustness [14]. The second feature is derived from the temporal sequence of the height and width information by using linear regression [13]. The last feature is calculated in the face detection process.…”

Section: A Audio-visual Integration For Vadmentioning

confidence: 99%

Two-layered audio-visual speech recognition for robots in noisy environments

Yoshida

Nakadai

Okuno

2010

2010 IEEE/RSJ International Conference on Intelligent Robots and Systems

View full text Add to dashboard Cite

Abstract-Audio-visual (AV) integration is one of the key ideas to improve perception in noisy real-world environments. This paper describes automatic speech recognition (ASR) to improve human-robot interaction based on AV integration. We developed AV-integrated ASR, which has two AV integration layers, that is, voice activity detection (VAD) and ASR. However, the system has three difficulties: 1) VAD and ASR have been separately studied although these processes are mutually dependent, 2) VAD and ASR assumed that high resolution images are available although this assumption never holds in the real world, and 3) an optimal weight between audio and visual stream was fixed while their reliabilities change according to environmental changes. To solve these problems, we propose a new VAD algorithm taking ASR characteristics into account, and a linear-regression-based optimal weight estimation method. We evaluate the algorithm for auditoryand/or visually-contaminated data. Preliminary results show that the robustness of VAD improved even when the resolution of the images is low, and the AVSR using estimated stream weight shows the effectiveness of AV integration.

show abstract

An Improvement in Audio-Visual Voice Activity Detection for Automatic Speech Recognition

Cited by 7 publications

References 13 publications

Sociorobot World

Sociorobot World

Audio-Visual Voice Activity Detection Using Diffusion Maps

Two-layered audio-visual speech recognition for robots in noisy environments

Contact Info

Product

Resources

About