SPEECH-COCO: 600k Visually Grounded Spoken Captions Aligned to MSCOCO Data Set

Havard, William N.; Besacier, Laurent; Rosec, Olivier

doi:10.21437/glu.2017-9

Cited by 19 publications

(17 citation statements)

References 15 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…We also evaluate our method on MS COCO2017 dataset [49], which contains more than 200,000 pictures and…”

Section: ) Datasetmentioning

confidence: 99%

Lightweight Cross-Fusion Network on Human Pose Estimation for Edge Device

Zhu¹,

Zeng

2023

IEEE Access

View full text Add to dashboard Cite

The deployment of human pose estimation on edge devices are essential task in computer vision. Due to memory and storage space limitations, it is difficult for edge devices to maintain implementing Convolutional Neural Networks, which deployed large-scale terminal platforms with abundant computing resources. This paper proposed novel Lightweight Cross-fusion Network on Human Pose Estimation with information sharing. Using state-of-the-art efficient neural architecture, and Ghost Net, as the backbone, which are gradually applying a cross-information fusion network for key points extraction in the baseline and strengthen phases. As a result, the computational cost significantly reduces, while maintaining feature confidence more accurate and predicting key points heatmaps more precisely. Our network model can entirely execute on edge devices, and extensive self-comparison experiments have evaluated the architecture's effectiveness. The MS COCO 2017 dataset proved that the cross-fusion network is superior than other lightweight structures for pose estimation

show abstract

“…We also evaluate our method on MS COCO2017 dataset [49], which contains more than 200,000 pictures and…”

Section: ) Datasetmentioning

confidence: 99%

Lightweight Cross-Fusion Network on Human Pose Estimation for Edge Device

Zhu¹,

Zeng

2023

IEEE Access

View full text Add to dashboard Cite

show abstract

“…Each image is paired by at least five written captions describing the scene using the object categories. SPEECH-COCO (Havard et al, 2017) was derived from MSCOCO by using a speech synthesizer to create spoken captions for more than 600k of the image descriptions in the original MSCOCO dataset (Chen et al, 2015). The speech was generated using a commercial Voxygen text-to-speech (TTS) system, which is a concatenative (Brent and Siskind, 2001) was used.…”

Section: Datamentioning

confidence: 99%

Can phones, syllables, and words emerge as side-products of cross-situational audiovisual learning? - A computational investigation

Khorrami¹,

Räsänen²

2021

Preprint

View full text Add to dashboard Cite

Decades of research has studied how language learning infants learn to discriminate speech sounds, segment words, and associate words with their meanings. While gradual development of such capabilities is unquestionable, the exact nature of these skills and the underlying mental representations yet remains unclear. In parallel, computational studies have shown that basic comprehension of speech can be achieved by statistical learning between speech and concurrent referentially ambiguous visual input. These models can operate without prior linguistic knowledge such as representations of linguistic units, and without learning mechanisms specifically targeted at such units. This has raised the question whether knowledge of linguistic units, such as phone(me)s, syllables, and words, could actually emerge as latent representations supporting the translation between speech and representations in other modalities, and instead of the units ever being proximal learning goals for the learner. In this study, formulate this idea as the so-called latent language hypothesis (LLH), connecting linguistic representation learning to general predictive processing within and across sensory modalities. We review the extent that the audiovisual aspect of LLH is supported by the existing computational studies. We then explore LLH further in extensive learning simulations with different neural network models for audiovisual cross-situational learning, and comparing learning from both synthetic and real speech data. We investigate whether the latent representations learned by the networks reflect phonetic, syllabic, or lexical structure of input speech by utilizing an array of complementary evaluation metrics related to linguistic selectivity and temporal characteristics of the representations. As a result, we find that representations associated with phonetic, syllabic, and lexical units of speech indeed emerge from the audiovisual learning process. The finding is also robust against variations in model architecture or characteristics of model training and testing data. The results suggest that cross-modal and cross-situational learning may, in principle, assist in early language development much beyond just enabling association of acoustic word forms to their referential meanings.

show abstract

“…In addition to the FACC dataset, we use the SpeechCOCO dataset (Havard et al, 2017) to pretrain our models. SpeechCOCO contains over 600 hours of synthesised speech paired with images, as opposed to natural speech in the FACC dataset.…”

Section: Datasetmentioning

confidence: 99%

Fine-Grained Grounding for Multimodal Speech Recognition

Srinivasan

Sanabria

Metze

et al. 2020

Findings of the Association for Computational Linguistics: EMNLP 2020

View full text Add to dashboard Cite

Multimodal automatic speech recognition systems integrate information from images to improve speech recognition quality, by grounding the speech in the visual context. While visual signals have been shown to be useful for recovering entities that have been masked in the audio, these models should be capable of recovering a broader range of word types. Existing systems rely on global visual features that represent the entire image, but localizing the relevant regions of the image will make it possible to recover a larger set of words, such as adjectives and verbs. In this paper, we propose a model that uses finer-grained visual information from different parts of the image, using automatic object proposals. In experiments on the Flickr8K Audio Captions Corpus, we find that our model improves over approaches that use global visual features, that the proposals enable the model to recover entities and other related words, such as adjectives, and that improvements are due to the model's ability to localize the correct proposals. 1

show abstract

SPEECH-COCO: 600k Visually Grounded Spoken Captions Aligned to MSCOCO Data Set

Cited by 19 publications

References 15 publications

Lightweight Cross-Fusion Network on Human Pose Estimation for Edge Device

Lightweight Cross-Fusion Network on Human Pose Estimation for Edge Device

Can phones, syllables, and words emerge as side-products of cross-situational audiovisual learning? - A computational investigation

Fine-Grained Grounding for Multimodal Speech Recognition

Contact Info

Product

Resources

About