2015
DOI: 10.1609/aaai.v29i1.9512
|View full text |Cite
|
Sign up to set email alerts
|

Jointly Modeling Deep Video and Compositional Text to Bridge Vision and Language in a Unified Framework

Abstract: Recently, joint video-language modeling has been attracting more and more attention. However, most existing approaches focus on exploring the language model upon on a fixed visual model. In this paper, we propose a unified framework that jointly models video and the corresponding text sentences. The framework consists of three parts: a compositional semantics language model, a deep video model and a joint embedding model. In our language model, we propose a dependency-tree structure model that embeds sentence … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
32
0

Year Published

2016
2016
2023
2023

Publication Types

Select...
5
4

Relationship

0
9

Authors

Journals

citations
Cited by 164 publications
(32 citation statements)
references
References 29 publications
0
32
0
Order By: Relevance
“…There are generally two categories of methods for video captioning: template-based models (Kojima, Tamura, and Fukunaga 2002;Rohrbach et al 2013;Guadarrama et al 2013;Xu et al 2015) and sequence learning models (e.g., RNN) (Donahue et al 2015;Pan et al 2016a;Venugopalan et al 2015a;Yao et al 2015;Venugopalan et al 2015b;Pan et al 2016b). The former predefines the special rule for language grammar and then parses the sentence into several parts (e.g., subject, verb, object).…”
Section: Related Workmentioning
confidence: 99%
See 1 more Smart Citation
“…There are generally two categories of methods for video captioning: template-based models (Kojima, Tamura, and Fukunaga 2002;Rohrbach et al 2013;Guadarrama et al 2013;Xu et al 2015) and sequence learning models (e.g., RNN) (Donahue et al 2015;Pan et al 2016a;Venugopalan et al 2015a;Yao et al 2015;Venugopalan et al 2015b;Pan et al 2016b). The former predefines the special rule for language grammar and then parses the sentence into several parts (e.g., subject, verb, object).…”
Section: Related Workmentioning
confidence: 99%
“…In (Guadarrama et al 2013), Guadarrama et al utilize semantic hierarchies to choose an appropriate level of the specificity and accuracy of sentence fragments. Recently, Xu et al design a unified framework in (Xu et al 2015), which consists of a compositional semantics language model, a deep video model and an embedding model to capture the joint video-language relationship for video sentence generation.…”
Section: Template-based Modelmentioning
confidence: 99%
“…Many multi-modal retrieval frameworks that retrieve images using text queries (Jeon, Lavrenko, and Manmatha 2003;Guillaumin et al 2009;Xu et al 2015) have been proposed. However, we use the PCCA-based retrieval model, which easily combines motion and video features and can be easily extended to the re-ranking model.…”
Section: Egocentric Video Search Frameworkmentioning
confidence: 99%
“…Recently, the multimodal learning between image and language (Ma et al 2015;Nakamura et al 2013;Xu et al 2015b) has become an increasingly popular research area of artificial intelligence (AI). In particular, there have been rapid progresses on the tasks of bidirectional image and sentence retrieval (Frome et al 2013;Socher et al 2014;Klein et al 2015;Karpathy, Joulin, and Li 2014;Ma et al 2015;Ordonez, Kulkarni, and Berg 2011), and automatic image captioning (Chen and Zitnick 2014;Donahue et al 2014;Fang et al 2014;Kiros, Salakhutdinov, and Zemel 2014a;Klein et al 2015;Mao et al 2014a;Vinyals et al 2014;Xu et al 2015a).…”
Section: Introductionmentioning
confidence: 99%