Models of Visually Grounded Speech Signal Pay Attention to Nouns: A Bilingual Experiment on English and Japanese

Havard, William N.; Chevrot, Jean-Pierre; Besacier, Laurent

doi:10.1109/icassp.2019.8683069

Cited by 24 publications

(22 citation statements)

References 20 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In terms of our more detailed linguistic analyses, the present findings largely align with the earlier literature on investigating linguistic units in VGS models (e.g., Chrupała et al, 2017;Alishahi et al, 2017;Havard et al, 2019aHavard et al, , 2019bMerkx et al, 2019). However, the present study is the first one to show that broadly similar learning takes place in different model architectures (convolutional and recurrent)…”

Section: Discussionsupporting

confidence: 90%

“…They concluded that the presence of individual words in the input can be best predicted using activations of an intermediate (recurrent) layer of their model. Havard et al (2019a) studied neural attention mechanism (Bahdanau et al, 2015) in an RNN-based VGS model using English and Japanese speech data. They found that similar to human attention (Gentner, 1982), neural attention mostly focuses on nouns and word endings.…”

Section: Earlier Related Workmentioning

confidence: 99%

See 1 more Smart Citation

Can phones, syllables, and words emerge as side-products of cross-situational audiovisual learning? - A computational investigation

Khorrami¹,

Räsänen²

2021

Preprint

View full text Add to dashboard Cite

Decades of research has studied how language learning infants learn to discriminate speech sounds, segment words, and associate words with their meanings. While gradual development of such capabilities is unquestionable, the exact nature of these skills and the underlying mental representations yet remains unclear. In parallel, computational studies have shown that basic comprehension of speech can be achieved by statistical learning between speech and concurrent referentially ambiguous visual input. These models can operate without prior linguistic knowledge such as representations of linguistic units, and without learning mechanisms specifically targeted at such units. This has raised the question whether knowledge of linguistic units, such as phone(me)s, syllables, and words, could actually emerge as latent representations supporting the translation between speech and representations in other modalities, and instead of the units ever being proximal learning goals for the learner. In this study, formulate this idea as the so-called latent language hypothesis (LLH), connecting linguistic representation learning to general predictive processing within and across sensory modalities. We review the extent that the audiovisual aspect of LLH is supported by the existing computational studies. We then explore LLH further in extensive learning simulations with different neural network models for audiovisual cross-situational learning, and comparing learning from both synthetic and real speech data. We investigate whether the latent representations learned by the networks reflect phonetic, syllabic, or lexical structure of input speech by utilizing an array of complementary evaluation metrics related to linguistic selectivity and temporal characteristics of the representations. As a result, we find that representations associated with phonetic, syllabic, and lexical units of speech indeed emerge from the audiovisual learning process. The finding is also robust against variations in model architecture or characteristics of model training and testing data. The results suggest that cross-modal and cross-situational learning may, in principle, assist in early language development much beyond just enabling association of acoustic word forms to their referential meanings.

show abstract

Section: Discussionsupporting

confidence: 90%

Section: Earlier Related Workmentioning

confidence: 99%

Can phones, syllables, and words emerge as side-products of cross-situational audiovisual learning? - A computational investigation

Khorrami¹,

Räsänen²

2021

Preprint

View full text Add to dashboard Cite

show abstract

“…Kádár et al (2017) introduced omission scores to interpret the contribution of individual tokens in text-based VGS models. More recently, Havard et al (2019) studied the behaviour of attention in RNN-based VGS models and showed that these models tend to focus on nouns and could display language-specific patterns, such as focusing on particules when prompted with Japanese. Recently, Harwath et al (2018) showed that CNN-based models could reliably map word-like units to their visual referents, and Harwath and Glass (2019) showed such networks were sensitive to diphone transitions and that these were useful for the purpose of word recognition.…”

Section: Word Recognition In Humansmentioning

confidence: 99%

“…and more recenlty Merkx et al (2019) showed that RNN-based utterance embeddings contain information about individual words, but did not show for what type of words this behaviour holds true and if the model had learnt to map these individual words to their visual referents. Havard et al (2019) showed that the attention mechanism of RNN-based VGS models tends to focus on the end of words that correspond to the main concept of the target image. This suggests that such models are able to isolate the target word forms from fluent speech and thus segment their inputs into sub-units.…”

Section: Modelmentioning

confidence: 99%

Word Recognition, Competition, and Activation in a Model of Visually Grounded Speech

Havard¹,

Chevrot²,

Besacier³

2019

Proceedings of the 23rd Conference on Computational Natural Language Learning (CoNLL)

Self Cite

View full text Add to dashboard Cite

In this paper, we study how word-like units are represented and activated in a recurrent neural model of visually grounded speech. The model used in our experiments is trained to project an image and its spoken description in a common representation space. We show that a recurrent model trained on spoken sentences implicitly segments its input into word-like units and reliably maps them to their correct visual referents. We introduce a methodology originating from linguistics to analyse the representation learned by neural networks -the gating paradigm -and show that the correct representation of a word is only activated if the network has access to first phoneme of the target word, suggesting that the network does not rely on a global acoustic pattern. Furthermore, we find out that not all speech frames (MFCC vectors in our case) play an equal role in the final encoded representation of a given word, but that some frames have a crucial effect on it. Finally, we suggest that word representation could be activated through a process of lexical competition. Related WorkIn this section we explore what is known about word recognition in humans. We then review recent works related to the representation of lan-arXiv:1909.08491v1 [cs.CL]

show abstract

“…Cross-lingual translation research has focused on text-to-text translation [21,22] as well as speechto-text from one language to another [23,24,25]. [5] recently showed that joint image and speech training performs well on cross lingual caption retrieval using English and Hindi, serving as a basis for speech to speech pseudo translation and [26] confirmed this result using an English-Japanese dataset. A similar line of work was presented in [27], which explored cross-lingual keyword spotting using a visual tagging system.…”

Section: Prior Workmentioning

confidence: 81%

Towards Bilingual Lexicon Discovery From Visually Grounded Speech Audio

Azuh¹,

Harwath

Glass

2019

Interspeech 2019

View full text Add to dashboard Cite

In this paper, we present a method for the discovery of word-like units and their approximate translations from visually grounded speech across multiple languages. We first train a neural network model to map images and their spoken audio captions in both English and Hindi to a shared, multimodal embedding space. Next, we use this model to segment and cluster regions of the spoken captions which approximately correspond to words. Finally, we exploit between-cluster similarities in the embedding space to associate English pseudo-word clusters with Hindi pseudo-word clusters, and show that many of these cluster pairings capture semantic translations between English and Hindi words. We present quantitative cross-lingual clustering results, as well as qualitative results in the form of a bilingual picture dictionary.

show abstract

Models of Visually Grounded Speech Signal Pay Attention to Nouns: A Bilingual Experiment on English and Japanese

Cited by 24 publications

References 20 publications

Can phones, syllables, and words emerge as side-products of cross-situational audiovisual learning? - A computational investigation

Can phones, syllables, and words emerge as side-products of cross-situational audiovisual learning? - A computational investigation

Word Recognition, Competition, and Activation in a Model of Visually Grounded Speech

Towards Bilingual Lexicon Discovery From Visually Grounded Speech Audio

Contact Info

Product

Resources

About