2020
DOI: 10.48550/arxiv.2011.09804
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

TaL: a synchronised multi-speaker corpus of ultrasound tongue imaging, audio, and lip videos

Abstract: We present the Tongue and Lips corpus (TaL), a multi-speaker corpus of audio, ultrasound tongue imaging, and lip videos. TaL consists of two parts: TaL1 is a set of six recording sessions of one professional voice talent, a male native speaker of English; TaL80 is a set of recording sessions of 81 native speakers of English without voice talent experience. Overall, the corpus contains 24 hours of parallel ultrasound, video, and audio data, of which approximately 13.5 hours are speech. This paper describes the … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
15
0

Year Published

2021
2021
2021
2021

Publication Types

Select...
2
1

Relationship

0
3

Authors

Journals

citations
Cited by 3 publications
(15 citation statements)
references
References 35 publications
(45 reference statements)
0
15
0
Order By: Relevance
“…Our best MCD score of 3.08 corresponds to a low-quality but intelligible speech [4]. In comparison, Ribeiro et al obtained an MCD score of 2.99 on the same corpus using more sophisticated encoder-decoder networks [20].…”
Section: The Impact Of Vad On the Ssimentioning
confidence: 67%
See 1 more Smart Citation
“…Our best MCD score of 3.08 corresponds to a low-quality but intelligible speech [4]. In comparison, Ribeiro et al obtained an MCD score of 2.99 on the same corpus using more sophisticated encoder-decoder networks [20].…”
Section: The Impact Of Vad On the Ssimentioning
confidence: 67%
“…For the experiments we used the English TAL corpus [20]. It contains parallel ultrasound, speech and lip video recordings from 81 native English speakers, and we used just the TaL1 subset which contains recordings from one male native speaker.…”
Section: The Ultrasound Datasetmentioning
confidence: 99%
“…As Fig 1 shows, the input to our system is a sequence of ultrasound tongue imaging (UTI) frames, and the target sequence is a speech signal. This is a sequence-to-sequence mapping problem, which could be addressed by sophisticated encoder-decoder networks that would not even require aligned training data [25]. However, as we have synchronized input-output samples, most authors apply simpler networks that perform the mapping frame by frame [28,5].…”
Section: The Ssi Frameworkmentioning
confidence: 99%
“…In the experiments we used the TaL80 corpus [25], which contains ultrasound, speech and lip video recordings from 81 speakers. Apart from the silent speech experiments, the speech signals were also recorded in parallel with the ultrasound, and here we used these synchronized ultrasound and speech tracks.…”
Section: Experimental Set-upmentioning
confidence: 99%
See 1 more Smart Citation