Jointly Modeling Deep Video and Compositional Text to Bridge Vision and Language in a Unified Framework

Xu, Ran; Xiong, Caiming; Chen, Wei; Corso, Jason J.

doi:10.1609/aaai.v29i1.9512

Cited by 164 publications

(32 citation statements)

References 29 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…There are generally two categories of methods for video captioning: template-based models (Kojima, Tamura, and Fukunaga 2002;Rohrbach et al 2013;Guadarrama et al 2013;Xu et al 2015) and sequence learning models (e.g., RNN) (Donahue et al 2015;Pan et al 2016a;Venugopalan et al 2015a;Yao et al 2015;Venugopalan et al 2015b;Pan et al 2016b). The former predefines the special rule for language grammar and then parses the sentence into several parts (e.g., subject, verb, object).…”

Section: Related Workmentioning

confidence: 99%

“…In (Guadarrama et al 2013), Guadarrama et al utilize semantic hierarchies to choose an appropriate level of the specificity and accuracy of sentence fragments. Recently, Xu et al design a unified framework in (Xu et al 2015), which consists of a compositional semantics language model, a deep video model and an embedding model to capture the joint video-language relationship for video sentence generation.…”

Section: Template-based Modelmentioning

confidence: 99%

See 1 more Smart Citation

Video Captioning with Listwise Supervision

Yuan

Xue

Shi

2017

AAAI

View full text Add to dashboard Cite

Automatically describing video content with natural language is a fundamental challenging that has received increasing attention. However, existing techniques restrict the model learning on the pairs of each video and its own sentences, and thus fail to capture more holistically semantic relationships among all sentences. In this paper, we propose to model relative relationships of different video-sentence pairs and present a novel framework, named Long Short-Term Memory with Listwise Supervision (LSTM-LS), for video captioning. Given each video in training data, we obtain a ranking list of sentences w.r.t. a given sentence associated with the video using nearest-neighbor search. The ranking information is represented by a set of rank triplets that can be used to assess the quality of ranking list. The video captioning problem is then solved by learning LSTM model for sentence generation, through maximizing the ranking quality over all the sentences in the list. The experiments on MSVD dataset show that our proposed LSTM-LS produces better performance than the state of the art in generating natural sentences: 51.1% and 32.6% in terms of BLEU@4 and METEOR, respectively. Superior performances are also reported on the movie description M-VAD dataset.

show abstract

Section: Related Workmentioning

confidence: 99%

Section: Template-based Modelmentioning

confidence: 99%

Video Captioning with Listwise Supervision

Yuan

Xue

Shi

2017

AAAI

View full text Add to dashboard Cite

show abstract

“…Many multi-modal retrieval frameworks that retrieve images using text queries (Jeon, Lavrenko, and Manmatha 2003;Guillaumin et al 2009;Xu et al 2015) have been proposed. However, we use the PCCA-based retrieval model, which easily combines motion and video features and can be easily extended to the re-ranking model.…”

Section: Egocentric Video Search Frameworkmentioning

confidence: 99%

Egocentric Video Search via Physical Interactions

Miyanishi

Hirayama

Kong

et al. 2016

AAAI

View full text Add to dashboard Cite

Retrieving past egocentric videos about personal daily life is important to support and augment human memory. Most previous retrieval approaches have ignored the crucial feature of human-physical world interactions, which is greatly related to our memory and experience of daily activities. In this paper, we propose a gesture-based egocentric video retrieval framework, which retrieves past visual experience using body gestures as non-verbal queries. We use a probabilistic framework based on a canonical correlation analysis that models physical interactions through a latent space and uses them for egocentric video retrieval and re-ranking search results. By incorporating physical interactions into the retrieval models, we address the problems resulting from the variability of human motions. We evaluate our proposed method on motion and egocentric video datasets about daily activities in household settings and demonstrate that our egocentric video retrieval framework robustly improves retrieval performance when retrieving past videos from personal and even other persons' video archives.

show abstract

“…Recently, the multimodal learning between image and language (Ma et al 2015;Nakamura et al 2013;Xu et al 2015b) has become an increasingly popular research area of artificial intelligence (AI). In particular, there have been rapid progresses on the tasks of bidirectional image and sentence retrieval (Frome et al 2013;Socher et al 2014;Klein et al 2015;Karpathy, Joulin, and Li 2014;Ma et al 2015;Ordonez, Kulkarni, and Berg 2011), and automatic image captioning (Chen and Zitnick 2014;Donahue et al 2014;Fang et al 2014;Kiros, Salakhutdinov, and Zemel 2014a;Klein et al 2015;Mao et al 2014a;Vinyals et al 2014;Xu et al 2015a).…”

Section: Introductionmentioning

confidence: 99%

Learning to Answer Questions from Image Using Convolutional Neural Network

2016

AAAI

131

View full text Add to dashboard Cite

In this paper, we propose to employ the convolutional neural network (CNN) for the image question answering (QA) task. Our proposed CNN provides an end-to-end framework with convolutional architectures for learning not only the image and question representations, but also their inter-modal interactions to produce the answer. More specifically, our model consists of three CNNs: one image CNN to encode the image content, one sentence CNN to compose the words of the question, and one multimodal convolution layer to learn their joint representation for the classification in the space of candidate answer words. We demonstrate the efficacy of our proposed model on the DAQUAR and COCO-QA datasets, which are two benchmark datasets for image QA, with the performances significantly outperforming the state-of-the-art.

show abstract

Jointly Modeling Deep Video and Compositional Text to Bridge Vision and Language in a Unified Framework

Cited by 164 publications

References 29 publications

Video Captioning with Listwise Supervision

Video Captioning with Listwise Supervision

Egocentric Video Search via Physical Interactions

Learning to Answer Questions from Image Using Convolutional Neural Network

Contact Info

Product

Resources

About