2014
DOI: 10.1109/mmul.2014.29
|View full text |Cite
|
Sign up to set email alerts
|

Joint Video and Text Parsing for Understanding Events and Answering Queries

Abstract: We propose a framework for parsing video and text jointly for understanding events and answering user queries. Our framework produces a parse graph that represents the compositional structures of spatial information (objects and scenes), temporal information (actions and events) and causal information (causalities between events and fluents) in the video and text. The knowledge representation of our framework is based on a spatial-temporal-causal And-Or graph (S/T/C-AOG), which jointly models possible hierarch… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
5
0

Year Published

2016
2016
2021
2021

Publication Types

Select...
5
2

Relationship

0
7

Authors

Journals

citations
Cited by 90 publications
(5 citation statements)
references
References 53 publications
0
5
0
Order By: Relevance
“…We have shown how the learning of scripts, which is central to many of the processes described, can capitalize on the current state of the art in computer vision -the learning of possible sequences of behavior observed in an environment and encoding of the knowledge in the form of AND-OR graphs (Gupta et al, 2009;Si et al, 2011;Pei et al, 2011;Tu et al, 2014) which can in turn be encoded as scripts. In addition, rapid effective causal learning lies at the heart of learning scripts rapidly (Ho, 2014;Ho, 2016a;Ho & Liausvia, 2013, 2014.…”
Section: Discussionmentioning
confidence: 99%
See 3 more Smart Citations
“…We have shown how the learning of scripts, which is central to many of the processes described, can capitalize on the current state of the art in computer vision -the learning of possible sequences of behavior observed in an environment and encoding of the knowledge in the form of AND-OR graphs (Gupta et al, 2009;Si et al, 2011;Pei et al, 2011;Tu et al, 2014) which can in turn be encoded as scripts. In addition, rapid effective causal learning lies at the heart of learning scripts rapidly (Ho, 2014;Ho, 2016a;Ho & Liausvia, 2013, 2014.…”
Section: Discussionmentioning
confidence: 99%
“…With the advent of computer vision and other sensing technologies, scripts can be learned through visual observation (or through other sensory modalities). Recently, there had been some work done in using computer vision to observe a scene filled with (human and other) activities and construct some kind of AND-OR graphs that capture the possible sequences of activities that can take place (Gupta, Srinivasan, Shi & Davis, 2009;Si, Pei, Yao & Zhu, 2011;Pei, Jia & Zhu, 2011;Tu, Meng, Lee, Choe & Zhu. 2014).…”
Section: Rapid Learning Of Problem Solving Scriptsmentioning
confidence: 99%
See 2 more Smart Citations
“…That is, its evaluation shouldn't be as hard as the task itself, and it must not be solvable using shortcuts or cheats. To solve these two problems we propose the task of visual question answering (VQA) (Antol et al 2015;Geman et al 2015;Malinowski and Fritz 2014;Tu et al 2014;Bigham et al 2010;Gao et al 2015). The task of VQA requires a machine to answer a natural language question about an image as shown in figure 2.…”
Section: Visual Question Answeringmentioning
confidence: 99%