Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing 2018
DOI: 10.18653/v1/d18-1285
|View full text |Cite
|
Sign up to set email alerts
|

Grounding language acquisition by training semantic parsers using captioned videos

Abstract: We develop a semantic parser that is trained in a grounded setting using pairs of videos captioned with sentences. This setting is both data-efficient, requiring little annotation, and similar to the experience of children where they observe their environment and listen to speakers. The semantic parser recovers the meaning of English sentences despite not having access to any annotated sentences. It does so despite the ambiguity inherent in vision where a sentence may refer to any combination of objects, objec… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
12
0

Year Published

2019
2019
2022
2022

Publication Types

Select...
6
1

Relationship

0
7

Authors

Journals

citations
Cited by 13 publications
(13 citation statements)
references
References 16 publications
0
12
0
Order By: Relevance
“…• Non-textual Modality: Multitasking with images is used to perform spoken image captioning (Chrupala, 2019) and grammar induction (Zhao and Titov, 2020). Joint modeling was used in multiresolution language grounding Koncel-Kedziorski et al (2014), identifying referring expressions Roy et al (2019), multimodal MT (Zhou et al, 2018c), video parsing Ross et al (2018), learning latent semantic annotations (Qin et al, 2018) etc.,…”
Section: Learning Objectivementioning
confidence: 99%
See 2 more Smart Citations
“…• Non-textual Modality: Multitasking with images is used to perform spoken image captioning (Chrupala, 2019) and grammar induction (Zhao and Titov, 2020). Joint modeling was used in multiresolution language grounding Koncel-Kedziorski et al (2014), identifying referring expressions Roy et al (2019), multimodal MT (Zhou et al, 2018c), video parsing Ross et al (2018), learning latent semantic annotations (Qin et al, 2018) etc.,…”
Section: Learning Objectivementioning
confidence: 99%
“…• Non-textual Modality: For images, new datasets are curated for a variety of tasks including caption relevance (Suhr et al, 2019), multimodal MT (Zhou et al, 2018c), soccer commentaries (Koncel-Kedziorski et al, 2014 semantic role labeling (Silberer and Pinkal, 2018), instruction following (Han and Schlangen, 2017), navigation (Andreas and Klein, 2014), understanding physical causality of actions , understanding topological spatial expressions (Kelleher et al, 2006), spoken image captioning , entail-ment (Vu et al, 2018), image search (Kiros et al, 2018), scene generation (Chang et al, 2015), etc., Coming to videos, datasets have become popular for several tasks like identifying action segments (Regneri et al, 2013), sematic parsing (Ross et al, 2018), instruction following from visual demonstration , spatio-temporal question answering (Lei et al, 2020), etc.,…”
Section: New Datasetsmentioning
confidence: 99%
See 1 more Smart Citation
“…The same problem of assigning meaning to symbols has been a fruitful research direction in computer vision. Early work explored weak supervision and the correspondence problem between text annotations and image regions [7,14], with more modern approaches exploring joint image-text word embeddings [17], or building a language conditioned attention map over the images in caption generation, visual question answering and text-based retrieval [3,12,24,32,38,39,45,48,50]. Of particular interest, recent work has focused on multimodal and multilingual settings such as producing captions in many languages, visual-guided translations [8,16,44,46], or bilingual visual question answering [18].…”
Section: Prior Workmentioning
confidence: 99%
“…For example,Ross et al (2018) develop a CCG-based semantic parser for action annotations in videos, representing sentences in an approximate way-neglecting determiners and treating all entity references as variables.…”
mentioning
confidence: 99%