GLU 2017 International Workshop on Grounding Language Understanding 2017
DOI: 10.21437/glu.2017-9
|View full text |Cite
|
Sign up to set email alerts
|

SPEECH-COCO: 600k Visually Grounded Spoken Captions Aligned to MSCOCO Data Set

Abstract: This paper presents an augmentation of MSCOCO dataset where speech is added to image and text. Speech captions are generated using text-to-speech (TTS) synthesis resulting in 616,767 spoken captions (more than 600h) paired with images. Disfluencies and speed perturbation are added to the signal in order to sound more natural. Each speech signal (WAV) is paired with a JSON file containing exact timecode for each word/syllable/phoneme in the spoken caption. Such a corpus could be used for Language and Vision (La… Show more

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
2

Citation Types

0
17
0

Year Published

2018
2018
2023
2023

Publication Types

Select...
4
2
1
1

Relationship

1
7

Authors

Journals

citations
Cited by 19 publications
(17 citation statements)
references
References 15 publications
0
17
0
Order By: Relevance
“…We also evaluate our method on MS COCO2017 dataset [49], which contains more than 200,000 pictures and…”
Section: ) Datasetmentioning
confidence: 99%
“…We also evaluate our method on MS COCO2017 dataset [49], which contains more than 200,000 pictures and…”
Section: ) Datasetmentioning
confidence: 99%
“…Each image is paired by at least five written captions describing the scene using the object categories. SPEECH-COCO (Havard et al, 2017) was derived from MSCOCO by using a speech synthesizer to create spoken captions for more than 600k of the image descriptions in the original MSCOCO dataset (Chen et al, 2015). The speech was generated using a commercial Voxygen text-to-speech (TTS) system, which is a concatenative (Brent and Siskind, 2001) was used.…”
Section: Datamentioning
confidence: 99%
“…In addition to the FACC dataset, we use the SpeechCOCO dataset (Havard et al, 2017) to pretrain our models. SpeechCOCO contains over 600 hours of synthesised speech paired with images, as opposed to natural speech in the FACC dataset.…”
Section: Datasetmentioning
confidence: 99%