ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2019
DOI: 10.1109/icassp.2019.8683587
|View full text |Cite
|
Sign up to set email alerts
|

Multimodal One-shot Learning of Speech and Images

Abstract: Imagine a robot is shown new concepts visually together with spoken tags, e.g. "milk", "eggs", "butter". After seeing one paired audiovisual example per class, it is shown a new set of unseen instances of these objects, and asked to pick the "milk". Without receiving any hard labels, could it learn to match the new continuous speech input to the correct visual instance? Although unimodal one-shot learning has been studied, where one labelled example in a single modality is given per class, this example motivat… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
38
0

Year Published

2019
2019
2023
2023

Publication Types

Select...
5
4

Relationship

1
8

Authors

Journals

citations
Cited by 26 publications
(38 citation statements)
references
References 27 publications
(50 reference statements)
0
38
0
Order By: Relevance
“…A study was conducted to investigate the recent developments in Siamese convolutional neural networks [Eloff et al 2019]. In the study in [Eloff et al 2019], a dataset consisting of verbal and visual figures was used. High accuracy was achieved by using pixel distance on the developed Siamese model images.…”
Section: Literature Reviewmentioning
confidence: 99%
“…A study was conducted to investigate the recent developments in Siamese convolutional neural networks [Eloff et al 2019]. In the study in [Eloff et al 2019], a dataset consisting of verbal and visual figures was used. High accuracy was achieved by using pixel distance on the developed Siamese model images.…”
Section: Literature Reviewmentioning
confidence: 99%
“…Interest in this area has recently surged. Various learning objectives have been proposed, including autoencoding with structured latent spaces (van den Oord et al, 2017;Eloff et al, 2019;Chorowski et al, 2019;Hsu et al, 2017b;Hsu and Glass, 2018b;Khurana et al, 2019), predictive coding (Chung et al, 2019;Wang et al, 2020a), contrastive learning (Oord et al, 2018;Schneider et al, 2019), and more. Prior work addresses inferring linguistic content such as phones from the learned representations (Baevski et al, 2020;Kharitonov et al, 2020;Hsu et al, 2021).…”
Section: Related Workmentioning
confidence: 99%
“…A rich body of work has recently emerged investigating representation learning for speech using visual grounding objectives (Synnaeve et al, 2014;Harwath and Glass, 2015;Harwath et al, 2016;Kamper et al, 2017;Havard et al, 2019a;Merkx et al, 2019;Scharenborg et al, 2018;Hsu and Glass, 2018a;Kamper et al, 2018;Ilharco et al, 2019;Eloff et al, 2019), as well as how word-like and subword-like linguistic units can be made to emerge within these models (Harwath and Glass, 2017;Drexler and Glass, 2017;Havard et al, 2019b;Harwath et al, 2020). So far, these efforts have predominantly focused on inference, where the goal is to learn a mapping from speech waveforms to a semantic embedding space.…”
Section: Introductionmentioning
confidence: 99%
“…The first work in this direction relied on phone strings to represent the speech (Roy & Pentland, 2002;Roy, 2003), but more recently this learning has been shown to be possible directly on the speech signal (Synnaeve et al, 2014;Harwath & Glass, 2015;Harwath et al, 2016). Subsequent work on visually-grounded models of speech has investigated improvements and alternatives to the modeling or training algorithms (Leidal et al, 2017;Kamper et al, 2017c;Havard et al, 2019a;Merkx et al, 2019;Scharenborg et al, 2018;a;Ilharco et al, 2019;Eloff et al, 2019a), application to multilingual settings (Harwath et al, 2018a;Kamper & Roth, 2017;Azuh et al, 2019;Havard et al, 2019a), analysis of the linguistic abstractions, such as words and phones, which are learned by the models Harwath et al, 2018b;Drexler & Glass, 2017;Havard et al, 2019b), and the impact of jointly training with textual input (Holzenberger et al, 2019;Chrupała, 2019;Pasad et al, 2019). Representations learned by models of visually grounded speech are also well-suited for transfer learning to supervised tasks, being highly robust to noise and domain shift .…”
Section: Related Workmentioning
confidence: 99%