2017
DOI: 10.1007/s10844-016-0438-z
|View full text |Cite
|
Sign up to set email alerts
|

An audio-visual corpus for multimodal automatic speech recognition

Abstract: A review of available audio-visual speech corpora and a description of a new multimodal corpus of English speech recordings is provided. The new corpus containing 31 hours of recordings was created specifically to assist audio-visual speech recognition systems (AVSR) development. The database related to the corpus includes high-resolution, high-framerate stereoscopic video streams from RGB cameras, depth imaging stream utilizing Time-of-Flight camera accompanied by audio recorded using both: a microphone array… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
35
0

Year Published

2017
2017
2022
2022

Publication Types

Select...
5
2
1

Relationship

0
8

Authors

Journals

citations
Cited by 74 publications
(35 citation statements)
references
References 40 publications
0
35
0
Order By: Relevance
“…Much of this growth is spurred on by the development of human-machine interaction systems, and the commercial interest in speech-enabled mobile devices. As one example, Czyzewski et al [57] released a corpus containing 31 h of recordings which includes high-resolution, high-framerate stereoscopic video, multi-channel audio, and manually annotated lexical transcripts. This dataset, and others, provide diverse materials to develop and test new multi-modal recurrence approaches.…”
Section: Modalitymentioning
confidence: 99%
“…Much of this growth is spurred on by the development of human-machine interaction systems, and the commercial interest in speech-enabled mobile devices. As one example, Czyzewski et al [57] released a corpus containing 31 h of recordings which includes high-resolution, high-framerate stereoscopic video, multi-channel audio, and manually annotated lexical transcripts. This dataset, and others, provide diverse materials to develop and test new multi-modal recurrence approaches.…”
Section: Modalitymentioning
confidence: 99%
“…It is the largest audiovisual corpus of Polish speech (Igras M., Ziółko B., 2012;Jadczyk & Zi, 2015) as reported by Czyzewski et al (2017). The authors of this study evaluate the performance of a system built of acoustic and visual features and Dynamic Bayesian Network (DBN) models.…”
Section: Speech Corpusmentioning
confidence: 99%
“…This results in over 25 hours of recordings, consisting of a variety of speech scenarios, including text reading, issuing commands, telephonic speech, phonetically balanced 4.5 hourssub corpus recorded in an anechoic chamber, etc. (Czyzewski et al, 2017). (Tresadern, Ionita, & Cootes, 2011;van Ginneken, Frangi, Staal, ter Haar Romeny, & Viergever, 2002a)consists of a Point Distribution Model (PDM) aiming to learn the variations of valid shapes, and a set of flexible models capturing the grey-levels around a set of landmark feature points.…”
Section: Speech Corpusmentioning
confidence: 99%
“…A thorough overview of existing multimodal corpora and the challenges and limits involved in corpus building, can be found in [2] and [3]. As pointed out in [1], building a multimodal corpus requires to make decisions about several issues such as the number and gender of the participants, the modality of the recording (monologue from scripted text or free speech, dialogue), the number and characteristics of the recording devices (single camera, multicamera, microphones, motion capture systems, devices capable of capturing depth information, like Microsoft Kinect), the languages being used (single language, or multilingual), the signals to be captured (audio, facial expressions, hands and arms gestures, body posture), the words and sentences to be recorded in the case of scripted text monologues, etc.…”
Section: Introductionmentioning
confidence: 99%
“…The corpus contains 117,450 words, where 13,784 words are unique and about half of them appear only once. The Modality database [3] was designed specifically to assist audio-visual speech recognition systems development. This database includes high-resolution, high framerate stereoscopic video streams from RGB cameras, depth imaging stream utilizing Time-of-Flight camera accompanied by audio recorded using both, a microphone array and a microphone built in a mobile computer.…”
Section: Introductionmentioning
confidence: 99%