YouTube Scale, Large Vocabulary Video Annotation

Morsillo, Nicholas; Mann, Gideon; Pal, Christopher

doi:10.1007/978-3-642-12900-1_14

Cited by 13 publications

(11 citation statements)

References 42 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…every minute, 100 hours of video are uploaded to YouTube. 1 However, if a video is poorly tagged, its utility is dramatically diminished [24]. Automatic video description generation has the potential to help improve indexing and search quality for online videos.…”

Section: Introductionmentioning

confidence: 99%

Describing Videos by Exploiting Temporal Structure

Yao

Torabi

Cho

et al. 2015

2015 IEEE International Conference on Computer Vision (ICCV)

Self Cite

906

886

View full text Add to dashboard Cite

Recent progress in using recurrent neural networks (RNNs) for image description has motivated the exploration of their application for video description. However, while images are static, working with videos requires modeling their dynamic temporal structure and then properly integrating that information into a natural language description. In this context, we propose an approach that successfully takes into account both the local and global temporal structure of videos to produce descriptions. First, our approach incorporates a spatial temporal 3-D convolutional neural network (3-D CNN) representation of the short temporal dynamics. The 3-D CNN representation is trained on video action recognition tasks, so as to produce a representation that is tuned to human motion and behavior. Second we propose a temporal attention mechanism that allows to go beyond local temporal modeling and learns to automatically select the most relevant temporal segments given the text-generating RNN. Our approach exceeds the current state-of-art for both BLEU and METEOR metrics on the Youtube2Text dataset. We also present results on a new, larger and more challenging dataset of paired video and natural language descriptions.

show abstract

Section: Introductionmentioning

confidence: 99%

Describing Videos by Exploiting Temporal Structure

Yao

Torabi

Cho

et al. 2015

2015 IEEE International Conference on Computer Vision (ICCV)

Self Cite

906

886

View full text Add to dashboard Cite

show abstract

“…First, LSH is more suitable for indexing global image descriptors (e.g., CNN) while vocabulary tree is built to index local image descriptors (e.g., SIFT). Second, LSH shows better search performance because of using the hashing technique while the vocabulary tree uses the recursive clustering to partition the space resulting in worse performance and higher inaccuracy especially when the tree becomes deeper [22].…”

Section: Spatial-visual Searchmentioning

confidence: 99%

Hybrid Indexes for Spatial-Visual Search

Alfarrarjeh

Shahabi

Kim

2017

Proceedings of the on Thematic Workshops of ACM Multimedia 2017

View full text Add to dashboard Cite

Due to the growth of geo-tagged images, recent web and mobile applications provide search capabilities for images that are similar to a given query image and simultaneously within a given geographical area. In this paper, we focus on designing index structures to expedite these spatial-visual searches. We start by baseline indexes that are straightforward extensions of the current popular spatial (R*-tree) and visual (LSH) index structures. Subsequently, we propose hybrid index structures that evaluate both spatial and visual features in tandem. The unique challenge of this type of query is that there are inaccuracies in both spatial and visual features. Therefore, different traversals of the index structures may produce different images as output, some of which more relevant to the query than the others. We compare our hybrid structures with a set of baseline indexes in both performance and result accuracy using three real world datasets from Flickr, Google Street View, and GeoUGV.

show abstract

“…Brezeale and Cook [17] surveyed text, video, and audio features for classifying videos into a predefined set of genres, e.g., "sports" or "comedy". Morsillo et al [94] presented a brief review that focused on efficient and scalable methods for annotating Web videos at various levels including objects, scenes, actions, and high-level events. Lavee et al [67] reviewed event modeling methods, mostly in the context of simple human activity analysis.…”

Section: Related Reviewsmentioning

confidence: 99%

High-level event recognition in unconstrained videos

Jiang

Bhattacharya

Chang

et al. 2012

Int J Multimed Info Retr

159

108

View full text Add to dashboard Cite

The goal of high-level event recognition is to automatically detect complex high-level events in a given video sequence. This is a difficult task especially when videos are captured under unconstrained conditions by nonprofessionals. Such videos depicting complex events have limited quality control, and therefore, may include severe camera motion, poor lighting, heavy background clutter, and occlusion. However, due to the fast growing popularity of such videos, especially on the Web, solutions to this problem are in high demands and have attracted great interest from researchers. In this paper, we review current technologies for complex event recognition in unconstrained videos. While the existing solutions vary, we identify common key modules and provide detailed descriptions along with some insights for each of them, including extraction and representation of low-level features across different modalities, classification strategies, fusion techniques, etc. Publicly available benchmark datasets, performance metrics, and related research forums are also described. Finally, we discuss promising directions for future research.

show abstract

YouTube Scale, Large Vocabulary Video Annotation

Cited by 13 publications

References 42 publications

Describing Videos by Exploiting Temporal Structure

Describing Videos by Exploiting Temporal Structure

Hybrid Indexes for Spatial-Visual Search

High-level event recognition in unconstrained videos

Contact Info

Product

Resources

About