An audio-visual corpus for speech perception and automatic speech recognition

Cooke, Martin; Barker, Jon; Cunningham, Stuart; Xu, Sheng

doi:10.1121/1.2229005

Cited by 953 publications

(540 citation statements)

References 9 publications

(2 reference statements)

Supporting

Mentioning

536

Contrasting

Unclassified

Order By: Relevance

“…Speech files considered for the experiments are selected from the database presented in [20]. The database consists of speech files of 34 speakers.…”

Section: Experiments and Resultsmentioning

confidence: 99%

A Bayesian Approach for Single Channel Speech Separation

Kammi¹,

Karami²

2012

IJMLC

View full text Add to dashboard Cite

Abstract-This paper addresses the problem of single channel speech separation to extract and enhance the desired speech signals from mixed speech signals. We propose a new speech separation algorithm by utilizing Bayesian approach for the case in which, underlying sources are mixed at different levels of energies. This situation is not considered in many single channel speech separation methods. To validate the effectiveness of our proposed method, it is compared with a state-of-the art method which is a gain adapted Maximum Likelihood estimator. Through the experiments, we show that our proposed method outperforms the compared method.Index Terms-Single channel speech separation, bayesian approach, gain estimation.

show abstract

“…Speech files considered for the experiments are selected from the database presented in [20]. The database consists of speech files of 34 speakers.…”

Section: Experiments and Resultsmentioning

confidence: 99%

A Bayesian Approach for Single Channel Speech Separation

Kammi¹,

Karami²

2012

IJMLC

View full text Add to dashboard Cite

show abstract

“…For the research in this paper, we used the Grid Corpus [8], an audiovisual dataset which contains 34 speakers, each reciting 1000 command sentences (e.g. "bin blue on red seven now").…”

Section: Grid Corpusmentioning

confidence: 99%

“…Rather than working on a linguistic basis, it purely considers the data on a frame-by-frame basis, and attempts to identify conditions that produce the best audiovisual mapping. A large multi-speaker dataset (the Grid corpus [8]) and different configurations of a non-linear neural network are used to identify optimal parameters and the best use of data for estimating an audio feature vector, given only visual information as input. This could arguably be considered to be a data driven, rather than a language driven, approach.…”

Section: Introductionmentioning

confidence: 99%

A Data Driven Approach to Audiovisual Speech Mapping

Abel

Marxer

Barker

et al. 2016

Advances in Brain Inspired Cognitive Systems

Self Cite

View full text Add to dashboard Cite

Abstract. The concept of using visual information as part of audio speech processing has been of significant recent interest. This paper presents a data driven approach that considers estimating audio speech acoustics using only temporal visual information without considering linguistic features such as phonemes and visemes. Audio (log filterbank) and visual (2D-DCT) features are extracted, and various configurations of MLP and datasets are used to identify optimal results, showing that given a sequence of prior visual frames an equivalent reasonably accurate audio frame estimation can be mapped.

show abstract

“…The clean utterances in the CHIME-2 data are taken from the GRID corpus (Cooke et al, 2006) which contains utterances from 34 speakers reading 6-word sequences of the form command-color-preposition-letterdigit-adverb. There are 25 different letters, 10 different digits and 4 different alternatives for each of the other classes.…”

Section: Chime-2mentioning

confidence: 99%

Noise robust exemplar matching with alpha–beta divergence

Yılmaz

Gemmeke

hamme

2016

Speech Communication

View full text Add to dashboard Cite

The noise robust exemplar matching (N-REM) framework performs automatic speech recognition using exemplars, which are the labeled spectrographic representations of speech segments extracted from training data. By incorporating a sparse representations formulation, this technique remedies the inherent noise modeling problem of conventional exemplar matching-based automatic speech recognition systems. In this framework, noisy speech segments are approximated as a sparse linear combination of the exemplars of multiple lengths, each associated with a single speech unit such as words, half-words or phones. On account of the reconstruction error-based back end, the recognition accuracy highly depends on the congruence of the speech features and the divergence metric used to compare the speech segments with exemplars. In this work, we replace the conventional KullbackLeibler divergence (KLD) with a generalized divergence family called the Alpha-Beta divergence with two parameters, α and β, in conjunction with mel-scaled magnitude spectral features. The proposed recognizer traverses the (α,β) plane depending on the amount of contamination to provide better separation of speech and noise sources. Moreover, we apply our recently proposed active noise exemplar selection (ANES) technique in a more realistic scenario where the target utterances are degraded by genuine room noise. Recognition experiments on the small vocabulary track of the 2 nd CHiME Challenge and the AURORA-2 database have shown that the novel recognizer with the AB divergence and ANES outperforms the baseline system using the generalized KLD with tuned sparsity, especially at lower SNR levels.

show abstract

An audio-visual corpus for speech perception and automatic speech recognition

Cited by 953 publications

References 9 publications

A Bayesian Approach for Single Channel Speech Separation

A Bayesian Approach for Single Channel Speech Separation

A Data Driven Approach to Audiovisual Speech Mapping

Noise robust exemplar matching with alpha–beta divergence

Contact Info

Product

Resources

About