2015
DOI: 10.1613/jair.4556
|View full text |Cite
|
Sign up to set email alerts
|

A Compositional Framework for Grounding Language Inference, Generation, and Acquisition in Video

Abstract: We present an approach to simultaneously reasoning about a video clip and an entire natural-language sentence. The compositional nature of language is exploited to construct models which represent the meanings of entire sentences composed out of the meanings of the words in those sentences mediated by a grammar that encodes the predicate-argument relations. We demonstrate that these models faithfully represent the meanings of sentences and are sensitive to how the roles played by participants (nouns), their ch… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
20
0

Year Published

2016
2016
2021
2021

Publication Types

Select...
5
3

Relationship

4
4

Authors

Journals

citations
Cited by 29 publications
(21 citation statements)
references
References 86 publications
0
20
0
Order By: Relevance
“…This is possibly because of small objects, such as utensils and ingredients, are hard to detect using global visual features but are crucial for describing a recipe. Hence, one future extension for our work is to incorporate object detectors/trackers [39,40] into the current captioning system. We show qualitative results in Fig.…”
Section: Comparison With State-of-the-art Methodsmentioning
confidence: 99%
“…This is possibly because of small objects, such as utensils and ingredients, are hard to detect using global visual features but are crucial for describing a recipe. Hence, one future extension for our work is to incorporate object detectors/trackers [39,40] into the current captioning system. We show qualitative results in Fig.…”
Section: Comparison With State-of-the-art Methodsmentioning
confidence: 99%
“…Learning occurs mostly in simulation and with little visual ambiguity, and the resulting model is not a parser but a means of associating n-grams with visual concepts. Siddharth et al (2014) and Yu et al (2015) acquire the meaning of a lexicon from videos paired with sentences but assume a fully-trained parser. Matuszek et al (2012) similarly present a model to learn the meanings and referents of words restricted to attributes and static scenes.…”
Section: Prior Workmentioning
confidence: 99%
“…We call this process sentence directed video object codiscovery. It can be viewed as the inverse of video captioning/description (Barbu et al 2012;Das et al 2013;Guadarrama et al 2013;Rohrbach et al 2014;Venugopalan et al 2015;Yu et al 2015Yu et al , 2016 where object evidence (in the form of detections or other visual features) is first produced by pretrained detectors and then sentences are generated given the object appearance and movement.…”
Section: Figmentioning
confidence: 99%