2006
DOI: 10.1121/1.2229005
|View full text |Cite
|
Sign up to set email alerts
|

An audio-visual corpus for speech perception and automatic speech recognition

Abstract: An audio-visual corpus has been collected to support the use of common material in speech perception and automatic speech recognition studies. The corpus consists of high-quality audio and video recordings of 1000 sentences spoken by each of 34 talkers. Sentences are simple, syntactically identical phrases such as "place green at B 4 now". Intelligibility tests using the audio signals suggest that the material is easily identifiable in quiet and low levels of stationary noise. The annotated corpus is available… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

1
536
0
3

Year Published

2012
2012
2021
2021

Publication Types

Select...
5
3
1

Relationship

1
8

Authors

Journals

citations
Cited by 953 publications
(540 citation statements)
references
References 9 publications
(2 reference statements)
1
536
0
3
Order By: Relevance
“…Speech files considered for the experiments are selected from the database presented in [20]. The database consists of speech files of 34 speakers.…”
Section: Experiments and Resultsmentioning
confidence: 99%
“…Speech files considered for the experiments are selected from the database presented in [20]. The database consists of speech files of 34 speakers.…”
Section: Experiments and Resultsmentioning
confidence: 99%
“…For the research in this paper, we used the Grid Corpus [8], an audiovisual dataset which contains 34 speakers, each reciting 1000 command sentences (e.g. "bin blue on red seven now").…”
Section: Grid Corpusmentioning
confidence: 99%
“…Rather than working on a linguistic basis, it purely considers the data on a frame-by-frame basis, and attempts to identify conditions that produce the best audiovisual mapping. A large multi-speaker dataset (the Grid corpus [8]) and different configurations of a non-linear neural network are used to identify optimal parameters and the best use of data for estimating an audio feature vector, given only visual information as input. This could arguably be considered to be a data driven, rather than a language driven, approach.…”
Section: Introductionmentioning
confidence: 99%
“…The clean utterances in the CHIME-2 data are taken from the GRID corpus (Cooke et al, 2006) which contains utterances from 34 speakers reading 6-word sequences of the form command-color-preposition-letterdigit-adverb. There are 25 different letters, 10 different digits and 4 different alternatives for each of the other classes.…”
Section: Chime-2mentioning
confidence: 99%