2014
DOI: 10.21236/ada623249
|View full text |Cite
|
Sign up to set email alerts
|

Long-term Recurrent Convolutional Networks for Visual Recognition and Description

Abstract: Abstract-Models based on deep convolutional networks have dominated recent image interpretation tasks; we investigate whether models which are also recurrent are effective for tasks involving sequences, visual and otherwise. We describe a class of recurrent convolutional architectures which is end-to-end trainable and suitable for large-scale visual understanding tasks, and demonstrate the value of these models for activity recognition, image captioning, and video description. In contrast to previous models wh… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

6
2,848
0
6

Year Published

2014
2014
2023
2023

Publication Types

Select...
5
2
1

Relationship

1
7

Authors

Journals

citations
Cited by 2,395 publications
(2,864 citation statements)
references
References 51 publications
6
2,848
0
6
Order By: Relevance
“…In many other areas of computer vision, such as image classification, object detection, segmentation, or activity recognition, machine learning has allowed vision algorithms to train from offline data and learn about the world [5,23,13,25,9,28]. In each of these cases, the performance of the algorithm improves as it iterates through the training set of images.…”
Section: Introductionmentioning
confidence: 99%
“…In many other areas of computer vision, such as image classification, object detection, segmentation, or activity recognition, machine learning has allowed vision algorithms to train from offline data and learn about the world [5,23,13,25,9,28]. In each of these cases, the performance of the algorithm improves as it iterates through the training set of images.…”
Section: Introductionmentioning
confidence: 99%
“…Xiao et al [11] 20.9 28.1 38.0 Su and Jurie [10] 35.6 (0.4) Donahue et al [3] 40.9 (0. Table 1: Mean accuracy (%) and standard deviation across 10 splits on the SUN397 dataset.…”
Section: Methodsmentioning
confidence: 99%
“…dataset (Young et al 2014), a popular benchmark for caption generation and retrieval that has been used, among others, by Chen and Zitnick (2015); Donahue et al (2015); Fang et al (2015); Gong et al (2014b); Karpathy et al (2014); Karpathy and Fei-Fei (2015); Kiros et al (2014); Klein et al (2014); Lebret et al (2015); Mao et al (2015); Vinyals et al (2015); Xu et al (2015). Flickr30k contains 31,783 images focusing mainly on people and animals, and 158,915 English captions (five per image).…”
Section: Figmentioning
confidence: 99%
“…As mentioned in the Introduction, the most common imagelanguage understanding task in the literature is automatic image captioning (Chen and Zitnick 2015;Donahue et al 2015;Fang et al 2015;Farhadi et al 2010;Hodosh et al 2013;Karpathy and Fei-Fei 2015;Kiros et al 2014;Klein et al 2014;Lev et al 2016;Kulkarni et al 2011;Lebret et al 2015;Ma et al 2015;Mao et al 2015;Ordonez et al 2011;Vinyals et al 2015;Yao et al 2010). Of most importance to us are the methods attempting to associate local regions in an image with words or phrases in the captions, as they would likely benefit the most from our annotations.…”
Section: Grounded Language Understandingmentioning
confidence: 99%
See 1 more Smart Citation