2010
DOI: 10.1007/978-3-642-13022-9_6
|View full text |Cite
|
Sign up to set email alerts
|

An Improvement in Audio-Visual Voice Activity Detection for Automatic Speech Recognition

Abstract: Abstract. Noise-robust Automatic Speech Recognition (ASR) is essential for robots which are expected to communicate with humans in a daily environment. In such an environment, Voice Activity Detection (VAD) strongly affects the performance of ASR because there are many acoustically and visually noises. In this paper, we improved Audio-Visual VAD for our two-layered audio visual integration framework for ASR by using hangover processing based on erosion and dilation. We implemented proposed method to our audio-… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
8
0

Year Published

2010
2010
2019
2019

Publication Types

Select...
2
2
2

Relationship

0
6

Authors

Journals

citations
Cited by 7 publications
(8 citation statements)
references
References 13 publications
(12 reference statements)
0
8
0
Order By: Relevance
“…Systemic hardware and software implementation details can be found in the references of the chapter and other related references contained therein. 38 2 Intelligent Control System Architectures…”
Section: Discussionmentioning
confidence: 99%
“…Systemic hardware and software implementation details can be found in the references of the chapter and other related references contained therein. 38 2 Intelligent Control System Architectures…”
Section: Discussionmentioning
confidence: 99%
“…Let be a Gaussian mixture PDF, given by: (26) where is the number of Gaussian components, are the mixture weights that sum to one, and is the PDF of the th Gaussian component, given by: (27) where is the dimension of and is the determinant of . We assume two such GMMs, one for the speech absence hypothesis, , and the other for the speech presence hypothesis, .…”
Section: A Unimodal Estimation Of Speech Presence Indicatormentioning
confidence: 99%
“…The modalities are fused in the features level using a weighted sum and the combined audio-visual feature is compared to a threshold for the classification. Another approach for AV-VAD which is also designed for incorporation in an SRS was presented in [26]. The audio signal is represented by a feature based on a likelihood score for silence which is evaluated in the SRS based on recognition scores, and the video features are based on the width and the height of the lips.…”
Section: Introductionmentioning
confidence: 99%
“…To solve the issues in AV-VAD, we introduced AV-VAD based on Bayesian network [13], because Bayesian network provides a framework that integrates multiple features with some ambiguities by maximizing the likelihood of the total integrated system. Actually, we used the following features as the inputs of the Bayesian network:…”
Section: A Audio-visual Integration For Vadmentioning
confidence: 99%
“…This feature reported high noiserobustness [14]. The second feature is derived from the temporal sequence of the height and width information by using linear regression [13]. The last feature is calculated in the face detection process.…”
Section: A Audio-visual Integration For Vadmentioning
confidence: 99%