ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2019
DOI: 10.1109/icassp.2019.8683069
|View full text |Cite
|
Sign up to set email alerts
|

Models of Visually Grounded Speech Signal Pay Attention to Nouns: A Bilingual Experiment on English and Japanese

Abstract: We investigate the behaviour of attention in neural models of visually grounded speech trained on two languages: English and Japanese. Experimental results show that attention focuses on nouns and this behaviour holds true for two very typologically different languages. We also draw parallels between artificial neural attention and human attention and show that neural attention focuses on word endings as it has been theorised for human attention. Finally, we investigate how two visually grounded monolingual mo… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

2
20
0

Year Published

2019
2019
2022
2022

Publication Types

Select...
6
1
1

Relationship

1
7

Authors

Journals

citations
Cited by 24 publications
(22 citation statements)
references
References 20 publications
2
20
0
Order By: Relevance
“…In terms of our more detailed linguistic analyses, the present findings largely align with the earlier literature on investigating linguistic units in VGS models (e.g., Chrupała et al, 2017;Alishahi et al, 2017;Havard et al, 2019aHavard et al, , 2019bMerkx et al, 2019). However, the present study is the first one to show that broadly similar learning takes place in different model architectures (convolutional and recurrent)…”
Section: Discussionsupporting
confidence: 90%
See 1 more Smart Citation
“…In terms of our more detailed linguistic analyses, the present findings largely align with the earlier literature on investigating linguistic units in VGS models (e.g., Chrupała et al, 2017;Alishahi et al, 2017;Havard et al, 2019aHavard et al, , 2019bMerkx et al, 2019). However, the present study is the first one to show that broadly similar learning takes place in different model architectures (convolutional and recurrent)…”
Section: Discussionsupporting
confidence: 90%
“…They concluded that the presence of individual words in the input can be best predicted using activations of an intermediate (recurrent) layer of their model. Havard et al (2019a) studied neural attention mechanism (Bahdanau et al, 2015) in an RNN-based VGS model using English and Japanese speech data. They found that similar to human attention (Gentner, 1982), neural attention mostly focuses on nouns and word endings.…”
Section: Earlier Related Workmentioning
confidence: 99%
“…Kádár et al (2017) introduced omission scores to interpret the contribution of individual tokens in text-based VGS models. More recently, Havard et al (2019) studied the behaviour of attention in RNN-based VGS models and showed that these models tend to focus on nouns and could display language-specific patterns, such as focusing on particules when prompted with Japanese. Recently, Harwath et al (2018) showed that CNN-based models could reliably map word-like units to their visual referents, and Harwath and Glass (2019) showed such networks were sensitive to diphone transitions and that these were useful for the purpose of word recognition.…”
Section: Word Recognition In Humansmentioning
confidence: 99%
“…and more recenlty Merkx et al (2019) showed that RNN-based utterance embeddings contain information about individual words, but did not show for what type of words this behaviour holds true and if the model had learnt to map these individual words to their visual referents. Havard et al (2019) showed that the attention mechanism of RNN-based VGS models tends to focus on the end of words that correspond to the main concept of the target image. This suggests that such models are able to isolate the target word forms from fluent speech and thus segment their inputs into sub-units.…”
Section: Modelmentioning
confidence: 99%
“…Cross-lingual translation research has focused on text-to-text translation [21,22] as well as speechto-text from one language to another [23,24,25]. [5] recently showed that joint image and speech training performs well on cross lingual caption retrieval using English and Hindi, serving as a basis for speech to speech pseudo translation and [26] confirmed this result using an English-Japanese dataset. A similar line of work was presented in [27], which explored cross-lingual keyword spotting using a visual tagging system.…”
Section: Prior Workmentioning
confidence: 81%