Long-term Recurrent Convolutional Networks for Visual Recognition and Description

Donahue, Jeff; Hendricks, Lisa Anne; Rohrbach, Marcus; Venugopalan, Subhashini; Guadarrama, Sergio; Saenko, Kate; Darrell, Trevor

doi:10.21236/ada623249

Cited by 2,395 publications

(2,864 citation statements)

References 51 publications

Supporting

Mentioning

2,848

Contrasting

Unclassified

Order By: Relevance

“…In many other areas of computer vision, such as image classification, object detection, segmentation, or activity recognition, machine learning has allowed vision algorithms to train from offline data and learn about the world [5,23,13,25,9,28]. In each of these cases, the performance of the algorithm improves as it iterates through the training set of images.…”

Section: Introductionmentioning

confidence: 99%

Learning to Track at 100 FPS with Deep Regression Networks

Held

Thrun

Savarese

2016

Computer Vision – ECCV 2016

939

785

View full text Add to dashboard Cite

Abstract. Machine learning techniques are often used in computer vision due to their ability to leverage large amounts of training data to improve performance. Unfortunately, most generic object trackers are still trained from scratch online and do not benefit from the large number of videos that are readily available for offline training. We propose a method for offline training of neural networks that can track novel objects at test-time at 100 fps. Our tracker is significantly faster than previous methods that use neural networks for tracking, which are typically very slow to run and not practical for real-time applications. Our tracker uses a simple feed-forward network with no online training required. The tracker learns a generic relationship between object motion and appearance and can be used to track novel objects that do not appear in the training set. We test our network on a standard tracking benchmark to demonstrate our tracker's state-of-the-art performance. Further, our performance improves as we add more videos to our offline training set. To the best of our knowledge, our tracker 1 is the first neural-network tracker that learns to track generic objects at 100 fps.

show abstract

Section: Introductionmentioning

confidence: 99%

Learning to Track at 100 FPS with Deep Regression Networks

Held

Thrun

Savarese

2016

Computer Vision – ECCV 2016

939

785

View full text Add to dashboard Cite

show abstract

“…Xiao et al [11] 20.9 28.1 38.0 Su and Jurie [10] 35.6 (0.4) Donahue et al [3] 40.9 (0. Table 1: Mean accuracy (%) and standard deviation across 10 splits on the SUN397 dataset.…”

Section: Methodsmentioning

confidence: 99%

Scalable Multitask Representation Learning for Scene Classification

Lapin

Schiele

Hein

2014

2014 IEEE Conference on Computer Vision and Pattern Recognition

View full text Add to dashboard Cite

show abstract

“…dataset (Young et al 2014), a popular benchmark for caption generation and retrieval that has been used, among others, by Chen and Zitnick (2015); Donahue et al (2015); Fang et al (2015); Gong et al (2014b); Karpathy et al (2014); Karpathy and Fei-Fei (2015); Kiros et al (2014); Klein et al (2014); Lebret et al (2015); Mao et al (2015); Vinyals et al (2015); Xu et al (2015). Flickr30k contains 31,783 images focusing mainly on people and animals, and 158,915 English captions (five per image).…”

Section: Figmentioning

confidence: 99%

“…As mentioned in the Introduction, the most common imagelanguage understanding task in the literature is automatic image captioning (Chen and Zitnick 2015;Donahue et al 2015;Fang et al 2015;Farhadi et al 2010;Hodosh et al 2013;Karpathy and Fei-Fei 2015;Kiros et al 2014;Klein et al 2014;Lev et al 2016;Kulkarni et al 2011;Lebret et al 2015;Ma et al 2015;Mao et al 2015;Ordonez et al 2011;Vinyals et al 2015;Yao et al 2010). Of most importance to us are the methods attempting to associate local regions in an image with words or phrases in the captions, as they would likely benefit the most from our annotations.…”

Section: Grounded Language Understandingmentioning

confidence: 99%

“…There has been a recent surge of work in this area, and in particular, on the task of sentence-based image description (Chen and Zitnick 2015;Donahue et al 2015;Fang et al 2015;Farhadi et al 2010;Hodosh et al 2013;Karpathy and Fei-Fei 2015;Kiros et al 2014;Klein et al 2014;Kulkarni et al 2011;Lebret et al 2015;Mao et al 2015;Ordonez et al 2011;Vinyals et al 2015;Yao et al 2010) and visual question answering (Antol et al 2015;Gao et al 2015;Krishna et al 2016;Malinowski and Fritz 2014;Ren et al 2015;Yu et al 2015). Unfortunately, due to a lack of datasets that provide not only paired sentences and images, but detailed grounding of specific phrases in image regions, most of these methods attempt to directly learn mappings from whole images to whole sentences.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Flickr30k Entities: Collecting Region-to-Phrase Correspondences for Richer Image-to-Sentence Models

Plummer

Wang

Cervantes

et al. 2015

2015 IEEE International Conference on Computer Vision (ICCV)

746

102

View full text Add to dashboard Cite

The Flickr30k dataset has become a standard benchmark for sentence-based image description. This paper presents Flickr30k Entities, which augments the 158k captions from Flickr30k with 244k coreference chains, linking mentions of the same entities across different captions for the same image, and associating them with 276k manually annotated bounding boxes. Such annotations are essential for continued progress in automatic image description and grounded language understanding. They enable us to define a new benchmark for localization of textual entity mentions in an image. We present a strong baseline for this task that combines an image-text embedding, detectors for common objects, a color classifier, and a bias towards selecting larger objects. While our baseline rivals in accuracy more complex state-of-the-art models, we show that its gains cannot be easily parlayed into improvements on such tasks as image-sentence retrieval, thus underlining the limitations of current methods and the need for further research.

show abstract

Long-term Recurrent Convolutional Networks for Visual Recognition and Description

Cited by 2,395 publications

References 51 publications

Learning to Track at 100 FPS with Deep Regression Networks

Learning to Track at 100 FPS with Deep Regression Networks

Scalable Multitask Representation Learning for Scene Classification

Flickr30k Entities: Collecting Region-to-Phrase Correspondences for Richer Image-to-Sentence Models

Contact Info

Product

Resources

About