Grounding language acquisition by training semantic parsers using captioned videos

Ross, Candace; Barbu, Andrei; Berzak, Yevgeni; Myanganbayar, Battushig; Katz, Boris

doi:10.18653/v1/d18-1285

Cited by 13 publications

(13 citation statements)

References 16 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…• Non-textual Modality: Multitasking with images is used to perform spoken image captioning (Chrupala, 2019) and grammar induction (Zhao and Titov, 2020). Joint modeling was used in multiresolution language grounding Koncel-Kedziorski et al (2014), identifying referring expressions Roy et al (2019), multimodal MT (Zhou et al, 2018c), video parsing Ross et al (2018), learning latent semantic annotations (Qin et al, 2018) etc.,…”

Section: Learning Objectivementioning

confidence: 99%

“…• Non-textual Modality: For images, new datasets are curated for a variety of tasks including caption relevance (Suhr et al, 2019), multimodal MT (Zhou et al, 2018c), soccer commentaries (Koncel-Kedziorski et al, 2014 semantic role labeling (Silberer and Pinkal, 2018), instruction following (Han and Schlangen, 2017), navigation (Andreas and Klein, 2014), understanding physical causality of actions , understanding topological spatial expressions (Kelleher et al, 2006), spoken image captioning , entail-ment (Vu et al, 2018), image search (Kiros et al, 2018), scene generation (Chang et al, 2015), etc., Coming to videos, datasets have become popular for several tasks like identifying action segments (Regneri et al, 2013), sematic parsing (Ross et al, 2018), instruction following from visual demonstration , spatio-temporal question answering (Lei et al, 2020), etc.,…”

Section: New Datasetsmentioning

confidence: 99%

“…Kelleher et al (2006) use combinatory categorial grammar (CCG) to build a psycholinguistic based model to predict absolute proximity ratings to identify spatial proximity between objects in a natural scene. Ross et al (2018) employ CCG-based parsing to a fixed set of unary and binary derivation rules to generate semantic parses for videos.…”

Section: Stratificationmentioning

confidence: 99%

See 2 more Smart Citations

Grounding ‘Grounding’ in NLP

Chandu

Bisk

Black

2021

Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021

View full text Add to dashboard Cite

The NLP community has seen substantial recent interest in grounding to facilitate interaction between language technologies and the world. However, as a community, we use the term broadly to reference any linking of text to data or non-textual modality. In contrast, Cognitive Science more formally defines "grounding" as the process of establishing what mutual information is required for successful communication between two interlocutorsa definition which might implicitly capture the NLP usage but differs in intent and scope.We investigate the gap between these definitions and seek answers to the following questions: (1) What aspects of grounding are missing from NLP tasks? Here we present the dimensions of coordination, purviews and constraints.(2) How is the term "grounding" used in the current research? We study the trends in datasets, domains, and tasks introduced in recent NLP conferences. And finally, (3) How to advance our current definition to bridge the gap with Cognitive Science? We present ways to both create new tasks or repurpose existing ones to make advancements towards achieving a more complete sense of grounding.

show abstract

Section: Learning Objectivementioning

confidence: 99%

Section: New Datasetsmentioning

confidence: 99%

Section: Stratificationmentioning

confidence: 99%

See 1 more Smart Citation

Grounding ‘Grounding’ in NLP

Chandu

Bisk

Black

2021

Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021

View full text Add to dashboard Cite

show abstract

“…The same problem of assigning meaning to symbols has been a fruitful research direction in computer vision. Early work explored weak supervision and the correspondence problem between text annotations and image regions [7,14], with more modern approaches exploring joint image-text word embeddings [17], or building a language conditioned attention map over the images in caption generation, visual question answering and text-based retrieval [3,12,24,32,38,39,45,48,50]. Of particular interest, recent work has focused on multimodal and multilingual settings such as producing captions in many languages, visual-guided translations [8,16,44,46], or bilingual visual question answering [18].…”

Section: Prior Workmentioning

confidence: 99%

Visual Grounding in Video for Unsupervised Word Translation

Sigurdsson¹,

Alayrac²,

Nematzadeh³

et al. 2020

Preprint

View full text Add to dashboard Cite

There are thousands of actively spoken languages on Earth, but a single visual world. Grounding in this visual world has the potential to bridge the gap between all these languages. Our goal is to use visual grounding to improve unsupervised word mapping between languages. The key idea is to establish a common visual representation between two languages by learning embeddings from unpaired instructional videos narrated in the native language. Given this shared embedding we demonstrate that (i) we can map words between the languages, particularly the 'visual' words; (ii) that the shared embedding provides a good initialization for existing unsupervised text-based word translation techniques, forming the basis for our proposed hybrid visual-text mapping algorithm, MUVE; and (iii) our approach achieves superior performance by addressing the shortcomings of text-based methods -it is more robust, handles datasets with less commonality, and is applicable to low-resource languages. We apply these methods to translate words from English to French, Korean, and Japanese -all without any parallel corpora and simply by watching many videos of people speaking while doing things.

show abstract

“…For example,Ross et al (2018) develop a CCG-based semantic parser for action annotations in videos, representing sentences in an approximate way-neglecting determiners and treating all entity references as variables.…”

mentioning

confidence: 99%

A Type-coherent, Expressive Representation as an Initial Step to Language Understanding

Kim¹,

Schubert²

2019

Proceedings of the 13th International Conference on Computational Semantics - Long Papers

View full text Add to dashboard Cite

A growing interest in tasks involving language understanding by the NLP community has led to the need for effective semantic parsing and inference. Modern NLP systems use semantic representations that do not quite fulfill the nuanced needs for language understanding: adequately modeling language semantics, enabling general inferences, and being accurately recoverable. This document describes underspecified logical forms (ULF) for Episodic Logic (EL), which is an initial form for a semantic representation that balances these needs. ULFs fully resolve the semantic type structure while leaving issues such as quantifier scope, word sense, and anaphora unresolved; they provide a starting point for further resolution into EL, and enable certain structural inferences without further resolution. This document also presents preliminary results of creating a hand-annotated corpus of ULFs for the purpose of training a precise ULF parser, showing a three-person pairwise interannotator agreement of 0.88 on confident annotations. We hypothesize that a divide-and-conquer approach to semantic parsing starting with derivation of ULFs will lead to semantic analyses that do justice to subtle aspects of linguistic meaning, and will enable construction of more accurate semantic parsers. IntroductionEpisodic Logic (EL) is a semantic representation extending FOL, designed to closely match the expressivity and surface form of natural language and to enable deductive inference, uncertain inference, and NLog-like inference (Morbini and Schubert, 2009;Schubert and Hwang, 2000;Schubert, 2014). Kim and Schubert (2016) developed a system that transforms annotated WordNet glosses into EL axioms which were competitive with state-of-the-art lexical inference systems while achieving greater expressivity. While EL is representationally appropriate for language understanding, the current EL parser is too unreliable for general text: The phrase structures produced by the underlying Treebank parser leave many ambiguities in the semantic type structure, which are disambiguated incorrectly by the hand-coded compositional rules; moreover, errors in the phrase structures can further disrupt the resulting logical forms (LFs). Kim and Schubert (2016) discuss the limitations of the existing parser as a starting point for logically interpreting glosses of WordNet verb entries. In order to build a better EL parser, it seems natural to take advantage of recent advances in corpus-based parsing techniques.This document describes a type-coherent initial LF, or unscoped logical forms (ULF), for EL which captures the predicate-argument structure in the EL semantic types and is the first critical step in fullyresolved semantic interpretation of sentences. Montague's profoundly influential work (Montague, 1973) demonstrates that systematic assignments of appropriate semantic types to words and phrases allows us to view language as akin to formal logic, with meanings determined compositionally from syntactic structures. This view of language directly supports inferenc...

show abstract

Grounding language acquisition by training semantic parsers using captioned videos

Cited by 13 publications

References 16 publications

Grounding ‘Grounding’ in NLP

Grounding ‘Grounding’ in NLP

Visual Grounding in Video for Unsupervised Word Translation

A Type-coherent, Expressive Representation as an Initial Step to Language Understanding

Contact Info

Product

Resources

About