ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2020
DOI: 10.1109/icassp40776.2020.9053428
|View full text |Cite
|
Sign up to set email alerts
|

Trilingual Semantic Embeddings of Visually Grounded Speech with Self-Attention Mechanisms

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
13
0

Year Published

2020
2020
2023
2023

Publication Types

Select...
7

Relationship

1
6

Authors

Journals

citations
Cited by 19 publications
(13 citation statements)
references
References 21 publications
0
13
0
Order By: Relevance
“…Images and Spoken Captions. We fine-tune and evaluate our model on the Places Audio Caption dataset [4], which contains 100k images from the Places205 dataset [22] each with a spoken caption in Japanese [11] and Hindi [9]. We evaluate the performance on audio to image and image to audio retrieval using the standard recall metrics R@1, R@5, R@10.…”
Section: Methodsmentioning
confidence: 99%
See 3 more Smart Citations
“…Images and Spoken Captions. We fine-tune and evaluate our model on the Places Audio Caption dataset [4], which contains 100k images from the Places205 dataset [22] each with a spoken caption in Japanese [11] and Hindi [9]. We evaluate the performance on audio to image and image to audio retrieval using the standard recall metrics R@1, R@5, R@10.…”
Section: Methodsmentioning
confidence: 99%
“…We evaluate the performance on audio to image and image to audio retrieval using the standard recall metrics R@1, R@5, R@10. We follow the prior work [9,11] and report results on the validation sets of 1k images and spoken captions.…”
Section: Methodsmentioning
confidence: 99%
See 2 more Smart Citations
“…Whether CNNbased approaches or RNN-based approaches are employed, all seem to segment individual words from the inputted spoken utterance (Harwath et al, 2016;. This result stands also for languages other than English, such as Hindi or Japanese (Harwath et al, 2018;Azuh et al, 2019;Ohishi et al, 2020). and , however, observed that not all layers encode wordlike units, suggesting that some layers specialise in lexical processing whereas some other do not encode such information.…”
Section: Introduction and Prior Workmentioning
confidence: 99%