Word Recognition, Competition, and Activation in a Model of Visually Grounded Speech

Havard, William N.; Chevrot, Jean-Pierre; Besacier, Laurent

doi:10.18653/v1/k19-1032

Cited by 20 publications

(46 citation statements)

References 27 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In terms of our more detailed linguistic analyses, the present findings largely align with the earlier literature on investigating linguistic units in VGS models (e.g., Chrupała et al, 2017;Alishahi et al, 2017;Havard et al, 2019aHavard et al, , 2019bMerkx et al, 2019). However, the present study is the first one to show that broadly similar learning takes place in different model architectures (convolutional and recurrent)…”

Section: Discussionsupporting

confidence: 90%

Section: Earlier Related Workmentioning

confidence: 99%

“…This is in line with the knowledge that infant early vocabulary tends to predominantly consist of concrete nouns. In another study, Havard et al (2019b) examined the influence of different input data characteristics in a word recognition task by feeding the VGS model with synthesized isolated words with varying characteristics. They observed a moderate correlation between word recognition accuracy and frequency of the words in training data, and a weak correlation for image-related factors such as visual object size and saliency.…”

Section: Evidence For Language Representations In Vgs Modelsmentioning

confidence: 99%

See 2 more Smart Citations

Can phones, syllables, and words emerge as side-products of cross-situational audiovisual learning? - A computational investigation

Khorrami¹,

Räsänen²

2021

Preprint

View full text Add to dashboard Cite

Decades of research has studied how language learning infants learn to discriminate speech sounds, segment words, and associate words with their meanings. While gradual development of such capabilities is unquestionable, the exact nature of these skills and the underlying mental representations yet remains unclear. In parallel, computational studies have shown that basic comprehension of speech can be achieved by statistical learning between speech and concurrent referentially ambiguous visual input. These models can operate without prior linguistic knowledge such as representations of linguistic units, and without learning mechanisms specifically targeted at such units. This has raised the question whether knowledge of linguistic units, such as phone(me)s, syllables, and words, could actually emerge as latent representations supporting the translation between speech and representations in other modalities, and instead of the units ever being proximal learning goals for the learner. In this study, formulate this idea as the so-called latent language hypothesis (LLH), connecting linguistic representation learning to general predictive processing within and across sensory modalities. We review the extent that the audiovisual aspect of LLH is supported by the existing computational studies. We then explore LLH further in extensive learning simulations with different neural network models for audiovisual cross-situational learning, and comparing learning from both synthetic and real speech data. We investigate whether the latent representations learned by the networks reflect phonetic, syllabic, or lexical structure of input speech by utilizing an array of complementary evaluation metrics related to linguistic selectivity and temporal characteristics of the representations. As a result, we find that representations associated with phonetic, syllabic, and lexical units of speech indeed emerge from the audiovisual learning process. The finding is also robust against variations in model architecture or characteristics of model training and testing data. The results suggest that cross-modal and cross-situational learning may, in principle, assist in early language development much beyond just enabling association of acoustic word forms to their referential meanings.

show abstract

Section: Discussionsupporting

confidence: 90%

Section: Earlier Related Workmentioning

confidence: 99%

Section: Evidence For Language Representations In Vgs Modelsmentioning

confidence: 99%

See 1 more Smart Citation

Can phones, syllables, and words emerge as side-products of cross-situational audiovisual learning? - A computational investigation

Khorrami¹,

Räsänen²

2021

Preprint

View full text Add to dashboard Cite

show abstract

“…• Non-textual modality: Projection to a joint semantic space is used in spoken image captioning Havard et al, 2019), bicoding for learning image attributes (Silberer and Lapata, 2014), representation learning of images (Zarrieß and Schlangen, 2017) and speech (Vijayakumar et al, 2017).…”

Section: Manipulating Representationsmentioning

confidence: 99%

Grounding ‘Grounding’ in NLP

Chandu

Bisk

Black

2021

Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021

View full text Add to dashboard Cite

The NLP community has seen substantial recent interest in grounding to facilitate interaction between language technologies and the world. However, as a community, we use the term broadly to reference any linking of text to data or non-textual modality. In contrast, Cognitive Science more formally defines "grounding" as the process of establishing what mutual information is required for successful communication between two interlocutorsa definition which might implicitly capture the NLP usage but differs in intent and scope.We investigate the gap between these definitions and seek answers to the following questions: (1) What aspects of grounding are missing from NLP tasks? Here we present the dimensions of coordination, purviews and constraints.(2) How is the term "grounding" used in the current research? We study the trends in datasets, domains, and tasks introduced in recent NLP conferences. And finally, (3) How to advance our current definition to bridge the gap with Cognitive Science? We present ways to both create new tasks or repurpose existing ones to make advancements towards achieving a more complete sense of grounding.

show abstract

“…A rich body of work has recently emerged investigating representation learning for speech using visual grounding objectives (Synnaeve et al, 2014;Harwath and Glass, 2015;Harwath et al, 2016;Kamper et al, 2017;Havard et al, 2019a;Merkx et al, 2019;Scharenborg et al, 2018;Hsu and Glass, 2018a;Kamper et al, 2018;Ilharco et al, 2019;Eloff et al, 2019), as well as how word-like and subword-like linguistic units can be made to emerge within these models (Harwath and Glass, 2017;Drexler and Glass, 2017;Havard et al, 2019b;Harwath et al, 2020). So far, these efforts have predominantly focused on inference, where the goal is to learn a mapping from speech waveforms to a semantic embedding space.…”

Section: Introductionmentioning

confidence: 99%

Text-Free Image-to-Speech Synthesis Using Learned Segmental Units

Hsu¹,

Harwath²,

Miller³

et al. 2021

Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Confer

View full text Add to dashboard Cite

In this paper we present the first model for directly synthesizing fluent, natural-sounding spoken audio captions for images that does not require natural language text as an intermediate representation or source of supervision. Instead, we connect the image captioning module and the speech synthesis module with a set of discrete, sub-word speech units that are discovered with a self-supervised visual grounding task. We conduct experiments on the Flickr8k spoken caption dataset in addition to a novel corpus of spoken audio captions collected for the popular MSCOCO dataset, demonstrating that our generated captions also capture diverse visual semantics of the images they describe. We investigate several different intermediate speech representations, and empirically find that the representation must satisfy several important properties to serve as drop-in replacements for text.

show abstract

Word Recognition, Competition, and Activation in a Model of Visually Grounded Speech

Cited by 20 publications

References 27 publications

Can phones, syllables, and words emerge as side-products of cross-situational audiovisual learning? - A computational investigation

Can phones, syllables, and words emerge as side-products of cross-situational audiovisual learning? - A computational investigation

Grounding ‘Grounding’ in NLP

Text-Free Image-to-Speech Synthesis Using Learned Segmental Units

Contact Info

Product

Resources

About