Proceedings of the 23rd Conference on Computational Natural Language Learning (CoNLL) 2019
DOI: 10.18653/v1/k19-1032
|View full text |Cite
|
Sign up to set email alerts
|

Word Recognition, Competition, and Activation in a Model of Visually Grounded Speech

Abstract: In this paper, we study how word-like units are represented and activated in a recurrent neural model of visually grounded speech. The model used in our experiments is trained to project an image and its spoken description in a common representation space. We show that a recurrent model trained on spoken sentences implicitly segments its input into word-like units and reliably maps them to their correct visual referents. We introduce a methodology originating from linguistics to analyse the representation lear… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

1
42
3

Year Published

2021
2021
2022
2022

Publication Types

Select...
4
3
1

Relationship

0
8

Authors

Journals

citations
Cited by 20 publications
(46 citation statements)
references
References 27 publications
1
42
3
Order By: Relevance
“…In terms of our more detailed linguistic analyses, the present findings largely align with the earlier literature on investigating linguistic units in VGS models (e.g., Chrupała et al, 2017;Alishahi et al, 2017;Havard et al, 2019aHavard et al, , 2019bMerkx et al, 2019). However, the present study is the first one to show that broadly similar learning takes place in different model architectures (convolutional and recurrent)…”
Section: Discussionsupporting
confidence: 90%
See 2 more Smart Citations
“…In terms of our more detailed linguistic analyses, the present findings largely align with the earlier literature on investigating linguistic units in VGS models (e.g., Chrupała et al, 2017;Alishahi et al, 2017;Havard et al, 2019aHavard et al, , 2019bMerkx et al, 2019). However, the present study is the first one to show that broadly similar learning takes place in different model architectures (convolutional and recurrent)…”
Section: Discussionsupporting
confidence: 90%
“…This is in line with the knowledge that infant early vocabulary tends to predominantly consist of concrete nouns. In another study, Havard et al (2019b) examined the influence of different input data characteristics in a word recognition task by feeding the VGS model with synthesized isolated words with varying characteristics.…”
Section: Earlier Related Workmentioning
confidence: 99%
See 1 more Smart Citation
“…• Non-textual modality: Projection to a joint semantic space is used in spoken image captioning Havard et al, 2019), bicoding for learning image attributes (Silberer and Lapata, 2014), representation learning of images (Zarrieß and Schlangen, 2017) and speech (Vijayakumar et al, 2017).…”
Section: Manipulating Representationsmentioning
confidence: 99%
“…A rich body of work has recently emerged investigating representation learning for speech using visual grounding objectives (Synnaeve et al, 2014;Harwath and Glass, 2015;Harwath et al, 2016;Kamper et al, 2017;Havard et al, 2019a;Merkx et al, 2019;Scharenborg et al, 2018;Hsu and Glass, 2018a;Kamper et al, 2018;Ilharco et al, 2019;Eloff et al, 2019), as well as how word-like and subword-like linguistic units can be made to emerge within these models (Harwath and Glass, 2017;Drexler and Glass, 2017;Havard et al, 2019b;Harwath et al, 2020). So far, these efforts have predominantly focused on inference, where the goal is to learn a mapping from speech waveforms to a semantic embedding space.…”
Section: Introductionmentioning
confidence: 99%