Illustrative Language Understanding: Large-Scale Visual Grounding with Image Search

Kiros, Jamie; Chan, William; Hinton, Geoffrey E.

doi:10.18653/v1/p18-1085

Cited by 43 publications

(43 citation statements)

References 57 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Yang et al (2017) introduced a method for choosing between word and character embeddings while method for word embedding selection. Gating has also been widely applied to multimodal fusion (Arevalo et al, 2017;Wang et al, 2018b;Kiros et al, 2018). Our work is also related to recent methods that induce contextualized word representations (Mc-Cann et al, 2017;Peters et al, 2018) as well as pre-training language models for task-dependent fine-tuning (Dai and Le, 2015; Howard and Ruder, 2018;Radford et al, 2018).…”

Section: Related Workmentioning

confidence: 83%

See 1 more Smart Citation

InferLite: Simple Universal Sentence Representations from Natural Language Inference Data

Kiros¹,

Chan²

2018

Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

Self Cite

View full text Add to dashboard Cite

Natural language inference has been shown to be an effective supervised task for learning generic sentence embeddings. In order to better understand the components that lead to effective representations, we propose a lightweight version of InferSent (Conneau et al., 2017), called InferLite, that does not use any recurrent layers and operates on a collection of pre-trained word embeddings. We show that a simple instance of our model that makes no use of context, word ordering or position can still obtain competitive performance on the majority of downstream prediction tasks, with most performance gaps being filled by adding local contextual information through temporal convolutions. Our models can be trained in under 1 hour on a single GPU and allows for fast inference of new representations. Finally we describe a semantic hashing layer that allows our model to learn generic binary codes for sentences.

show abstract

Section: Related Workmentioning

confidence: 83%

“…We also experimented with additional embedding types, including Picturebook (Kiros et al, 2018), knowledge graph and neural machine translation based embeddings. While adding these embeddings improved performance on NLI, they did not lead to any performance gains on downstream tasks.…”

Section: Limitationsmentioning

confidence: 99%

InferLite: Simple Universal Sentence Representations from Natural Language Inference Data

Kiros¹,

Chan²

2018

Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

Self Cite

View full text Add to dashboard Cite

show abstract

“…We also adapted our approach to a visual dialogue task and achieved excellent performance. A possible improvement to our work is adding pre-trained embedding such as BERT (Devlin et al, 2018) or image-grounded word embedding (Kiros et al, 2018) to improve the semantic understanding capability of the models. his = 10 due to memory issues with large input sequences.…”

Section: Resultsmentioning

confidence: 99%

Multimodal Transformer Networks for End-to-End Video-Grounded Dialogue Systems

Lê¹,

Sahoo²,

Chen³

et al. 2019

Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

102

View full text Add to dashboard Cite

Developing Video-Grounded Dialogue Systems (VGDS), where a dialogue is conducted based on visual and audio aspects of a given video, is significantly more challenging than traditional image or text-grounded dialogue systems because (1) feature space of videos span across multiple picture frames, making it difficult to obtain semantic information; and (2) a dialogue agent must perceive and process information from different modalities (audio, video, caption, etc.) to obtain a comprehensive understanding. Most existing work is based on RNNs and sequence-to-sequence architectures, which are not very effective for capturing complex long-term dependencies (like in videos). To overcome this, we propose Multimodal Transformer Networks (MTN) to encode videos and incorporate information from different modalities. We also propose queryaware attention through an auto-encoder to extract query-aware features from non-text modalities. We develop a training procedure to simulate token-level decoding to improve the quality of generated responses during inference. We get state of the art performance on Dialogue System Technology Challenge 7 (DSTC7). Our model also generalizes to another multimodal visual-grounded dialogue task, and obtains promising performance. We implemented our models using PyTorch and the code is released at https://github. com/henryhungle/MTN.

show abstract

“…In EXP 2.1, we compare our model with the Inception V3 network (Ioffe and Szegedy, 2015) for the visual stimuli, and in EXP 2.2 with the SoundNet (Aytar et al, 2016) for the auditory stimuli. These two models present competitive results on different audio-visual recognition tasks (Jansen et al, 2018;Jiang et al, 2018;Kiros et al, 2018;Kumar et al, 2018). For all experiments, we trained the models 10 times and determined the mean accuracy and standard deviation for each modality.…”

Section: Methodsmentioning

confidence: 99%

Expectation Learning for Stimulus Prediction Across Modalities Improves Unisensory Classification

Barros¹,

Eppe²,

Parisi³

et al. 2019

Front. Robot. AI

View full text Add to dashboard Cite

Expectation learning is a unsupervised learning process which uses multisensory bindings to enhance unisensory perception. For instance, as humans, we learn to associate a barking sound with the visual appearance of a dog, and we continuously finetune this association over time, as we learn, e.g., to associate high-pitched barking with small dogs. In this work, we address the problem of developing a computational model that addresses important properties of expectation learning, in particular focusing on the lack of explicit external supervision other than temporal co-occurrence. To this end, we present a novel hybrid neural model based on audiovisual autoencoders and a recurrent self-organizing network for multisensory bindings that facilitate stimulus reconstructions across different sensory modalities. We refer to this mechanism as stimulus prediction across modalities and demonstrate that the proposed model is capable of learning concept bindings by evaluating it on unisensory classification tasks for audiovisual stimuli using the 43,500 Youtube videos from the animal subset of the AudioSet corpus.

show abstract

Illustrative Language Understanding: Large-Scale Visual Grounding with Image Search

Cited by 43 publications

References 57 publications

InferLite: Simple Universal Sentence Representations from Natural Language Inference Data

InferLite: Simple Universal Sentence Representations from Natural Language Inference Data

Multimodal Transformer Networks for End-to-End Video-Grounded Dialogue Systems

Expectation Learning for Stimulus Prediction Across Modalities Improves Unisensory Classification

Contact Info

Product

Resources

About