Joint visual-text modeling for automatic retrieval of multimedia documents

Iyengar, G.; Duygulu, Pınar; Feng, Siwei; Ircing, Pavel; Khudanpur, Sanjeev; Klakow, Dietrich; Krause, Matthew R.; Manmatha, R.; Nock, Harriet J.; Petkova, Desislava; Pytlik, Brock; Virga, Paola

doi:10.1145/1101149.1101154

Cited by 37 publications

(37 citation statements)

References 18 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…One might expect that these words would be predicted entirely by co-occurrence probabilities with their counterparts. We find that the combined model using both text and vision almost always produces the best results, an observation that is shared with other works in image and video annotation [21,19] Table 2 Top-20 annotation accuracy of selected pairs of words which naturally co-occur together.…”

Section: Methodssupporting

confidence: 67%

YouTube Scale, Large Vocabulary Video Annotation

Morsillo

Mann

Pal

2010

Video Search and Mining

View full text Add to dashboard Cite

As video content on the web continues to expand, it is increasingly important to properly annotate videos for effective search and mining. While the idea of annotating static imagery with keywords is relatively well known, the idea of annotating videos with natural language keywords to enhance search is an important emerging problem with great potential to improve the quality of video search. However, leveraging web-scale video datasets for automated annotation also presents new challenges and requires methods specialized for scalability and efficiency. In this chapter we review specific, state of the art techniques for video analysis, feature extraction and classification suitable for extremely large scale automated video annotation. We also review key algorithms and data structures that make truly large scale video search possible. Drawing from these observations and insights, we present a complete method for automatically augmenting keyword annotations to videos using previous annotations for a large collection of videos. Our approach is designed explicitly to scale to YouTube sized datasets and we present some experiments and analysis for keyword augmentation quality using a corpus of over 1.2 million YouTube videos. We demonstrate how the automated annotation of webscale video collections is indeed feasible, and that an approach combining visual features with existing textual annotations yields better results than unimodal models.

show abstract

Section: Methodssupporting

confidence: 67%

YouTube Scale, Large Vocabulary Video Annotation

Morsillo

Mann

Pal

2010

Video Search and Mining

View full text Add to dashboard Cite

show abstract

“…The first one is a standard Corel image set which contains 5000 images widely used for comparing results. The second one is the large scale data set consisting of the entire TRECVID 2003 development dataset and feature set used by [11].…”

Section: Resultsmentioning

confidence: 99%

“…For comparison the best published retrieval results we know of on the same data set are 0.31 (SML [6]) and 0.30 (NCRM [14]). On a TRECVID3 dataset [11] the corresponding numbers are 0.152 and 0.158 for the diswhere they do not work well for long queries crete MRF model and the NCRM model respectively. The discrete MRF takes 90s for all queries while NCRM takes 6.8 hrs.…”

Section: Introductionmentioning

confidence: 99%

A discrete direct retrieval model for image and video retrieval

Feng

Manmatha

2008

Proceedings of the 2008 International Conference on Content-Based Image and Video Retrieval

Self Cite

View full text Add to dashboard Cite

This paper proposes a formal framework for image and video retrieval using discrete Markov random fields (MRF). The training dataset consists of images with keywords (regions are not labeled). The model is built using a discrete vocabulary of vector quantized region or point features generated from the training images. Since performance is dependent on the size of the vocabulary, a large vocabulary of a couple of million visterms is used. Such large vocabularies cannot be generated by conventional clustering algorithms so hierarchical k-means is used to generate it. Unlike many previous techniques, our MRF based model doesn't require an explicit annotation step for retrieval. The model directly ranks all test images according to the posterior probability of an image given a query. Traditionally, most models are trained by maximizing likelihood -instead this model is trained by maximizing average precision. Image and video retrieval experiments are performed on two standard datasets (a Corel dataset and a TRECVID3 dataset) which consist of 4,500 images and about 44,100 keyframes respectively. The results show that based on a large visual vocabulary the model runs extremely fast on even very large datasets while having comparable retrieval performance to the best performing (continuous feature) models.

show abstract

“…Iyengar et al [11] proposed a probabilistic model that relates words and image parts through an intermediate layer that captures common concepts. Models in this category usually rely on strong assumptions, e.g.…”

Section: Related Workmentioning

confidence: 99%

Visual Semantic Search: Retrieving Videos via Complex Textual Queries

Lin

Fidler

Kong

et al. 2014

2014 IEEE Conference on Computer Vision and Pattern Recognition

127

112

View full text Add to dashboard Cite

In this paper, we tackle the problem of retrieving videos using complex natural language queries. Towards this goal, we first parse the sentential descriptions into a semantic graph, which is then matched to visual concepts using a generalized bipartite matching algorithm. Our approach exploits object appearance, motion and spatial relations, and learns the importance of each term using structure prediction. We demonstrate the effectiveness of our approach on a new dataset designed for semantic search in the context of autonomous driving, which exhibits complex and highly dynamic scenes with many objects. We show that our approach is able to locate a major portion of the objects described in the query with high accuracy, and improve the relevance in video retrieval.

show abstract

Joint visual-text modeling for automatic retrieval of multimedia documents

Cited by 37 publications

References 18 publications

YouTube Scale, Large Vocabulary Video Annotation

YouTube Scale, Large Vocabulary Video Annotation

A discrete direct retrieval model for image and video retrieval

Visual Semantic Search: Retrieving Videos via Complex Textual Queries

Contact Info

Product

Resources

About