Multimodal One-shot Learning of Speech and Images

Eloff, Ryan; Engelbrecht, Herman A.; Kamper, Herman

doi:10.1109/icassp.2019.8683587

Cited by 26 publications

(38 citation statements)

References 27 publications

(50 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…A study was conducted to investigate the recent developments in Siamese convolutional neural networks [Eloff et al 2019]. In the study in [Eloff et al 2019], a dataset consisting of verbal and visual figures was used. High accuracy was achieved by using pixel distance on the developed Siamese model images.…”

Section: Literature Reviewmentioning

confidence: 99%

Fastener Classification Using One-Shot Learning with Siamese Convolution Networks

Tastimur¹,

Akın²

2022

jucs

View full text Add to dashboard Cite

Deep Learning has been widely used in image-based applications such as object classification, object detection, and object recognition in recent years. Classifying highly similar objects is a very difficult problem. It is difficult to classify datasets in this situation where object similarity between classes and differences between classes are high. In this study, Siamese Convolution Neural Network, which is a similarity measurement-based network, has been practiced to classify 6 types of screws, 5 types of nuts, and 7 types of bolts that are very similar to each other. In addition, this neural network formed with the One-Shot Learning technique is trained. Thanks to the OSL technique, there is no need to use large data sets. Also, there is no need to use large amounts of data from each class. Adding a new class to be classified is also made easier by the use of the OSL technique. The performance results of the proposed method are manifested in detail in the article.

show abstract

Section: Literature Reviewmentioning

confidence: 99%

Fastener Classification Using One-Shot Learning with Siamese Convolution Networks

Tastimur¹,

Akın²

2022

jucs

View full text Add to dashboard Cite

show abstract

“…Interest in this area has recently surged. Various learning objectives have been proposed, including autoencoding with structured latent spaces (van den Oord et al, 2017;Eloff et al, 2019;Chorowski et al, 2019;Hsu et al, 2017b;Hsu and Glass, 2018b;Khurana et al, 2019), predictive coding (Chung et al, 2019;Wang et al, 2020a), contrastive learning (Oord et al, 2018;Schneider et al, 2019), and more. Prior work addresses inferring linguistic content such as phones from the learned representations (Baevski et al, 2020;Kharitonov et al, 2020;Hsu et al, 2021).…”

Section: Related Workmentioning

confidence: 99%

“…A rich body of work has recently emerged investigating representation learning for speech using visual grounding objectives (Synnaeve et al, 2014;Harwath and Glass, 2015;Harwath et al, 2016;Kamper et al, 2017;Havard et al, 2019a;Merkx et al, 2019;Scharenborg et al, 2018;Hsu and Glass, 2018a;Kamper et al, 2018;Ilharco et al, 2019;Eloff et al, 2019), as well as how word-like and subword-like linguistic units can be made to emerge within these models (Harwath and Glass, 2017;Drexler and Glass, 2017;Havard et al, 2019b;Harwath et al, 2020). So far, these efforts have predominantly focused on inference, where the goal is to learn a mapping from speech waveforms to a semantic embedding space.…”

Section: Introductionmentioning

confidence: 99%

Text-Free Image-to-Speech Synthesis Using Learned Segmental Units

Hsu¹,

Harwath²,

Miller³

et al. 2021

Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Confer

View full text Add to dashboard Cite

In this paper we present the first model for directly synthesizing fluent, natural-sounding spoken audio captions for images that does not require natural language text as an intermediate representation or source of supervision. Instead, we connect the image captioning module and the speech synthesis module with a set of discrete, sub-word speech units that are discovered with a self-supervised visual grounding task. We conduct experiments on the Flickr8k spoken caption dataset in addition to a novel corpus of spoken audio captions collected for the popular MSCOCO dataset, demonstrating that our generated captions also capture diverse visual semantics of the images they describe. We investigate several different intermediate speech representations, and empirically find that the representation must satisfy several important properties to serve as drop-in replacements for text.

show abstract

“…The first work in this direction relied on phone strings to represent the speech (Roy & Pentland, 2002;Roy, 2003), but more recently this learning has been shown to be possible directly on the speech signal (Synnaeve et al, 2014;Harwath & Glass, 2015;Harwath et al, 2016). Subsequent work on visually-grounded models of speech has investigated improvements and alternatives to the modeling or training algorithms (Leidal et al, 2017;Kamper et al, 2017c;Havard et al, 2019a;Merkx et al, 2019;Scharenborg et al, 2018;a;Ilharco et al, 2019;Eloff et al, 2019a), application to multilingual settings (Harwath et al, 2018a;Kamper & Roth, 2017;Azuh et al, 2019;Havard et al, 2019a), analysis of the linguistic abstractions, such as words and phones, which are learned by the models Harwath et al, 2018b;Drexler & Glass, 2017;Havard et al, 2019b), and the impact of jointly training with textual input (Holzenberger et al, 2019;Chrupała, 2019;Pasad et al, 2019). Representations learned by models of visually grounded speech are also well-suited for transfer learning to supervised tasks, being highly robust to noise and domain shift .…”

Section: Related Workmentioning

confidence: 99%

Learning Hierarchical Discrete Linguistic Units from Visually-Grounded Speech

Harwath,

Hsu,

Glass

2019

Preprint

View full text Add to dashboard Cite

In this paper, we present a method for learning discrete linguistic units by incorporating vector quantization layers into neural models of visually grounded speech. We show that our method is capable of capturing both word-level and sub-word units, depending on how it is configured. What differentiates this paper from prior work on speech unit learning is the choice of training objective. Rather than using a reconstruction-based loss, we use a discriminative, multimodal grounding objective which forces the learned units to be useful for semantic image retrieval. We evaluate the sub-word units on the ZeroSpeech 2019 challenge, achieving a 27.3% reduction in ABX error rate over the top-performing submission, while keeping the bitrate approximately the same. We also present experiments demonstrating the noise robustness of these units. Finally, we show that a model with multiple quantizers can simultaneously learn phone-like detectors at a lower layer and word-like detectors at a higher layer. We show that these detectors are highly accurate, discovering 279 words with an F1 score of greater than 0.5.

show abstract

Multimodal One-shot Learning of Speech and Images

Cited by 26 publications

References 27 publications

Fastener Classification Using One-Shot Learning with Siamese Convolution Networks

Fastener Classification Using One-Shot Learning with Siamese Convolution Networks

Text-Free Image-to-Speech Synthesis Using Learned Segmental Units

Learning Hierarchical Discrete Linguistic Units from Visually-Grounded Speech

Contact Info

Product

Resources

About