G. Iyengar scite author profile

We present a learning-based approach to the semantic indexing of multimedia content using cues derived from audio, visual, and text features. We approach the problem by developing a set of statistical models for a predefined lexicon. Novel concepts are then mapped in terms of the concepts in the lexicon. To achieve robust detection of concepts, we exploit features from multiple modalities, namely, audio, video, and text. Concept representations are modeled using Gaussian mixture models (GMM), hidden Markov models (HMM), and support vector machines (SVM). Models such as Bayesian networks and SVMs are used in a latefusion approach to model concepts that are not explicitly modeled in terms of features. Our experiments indicate promise in the proposed classification and fusion methodologies: our proposed fusion scheme achieves more than 10% relative improvement over the best unimodal concept detector.

show abstract

Speaker Localisation Using Audio-Visual Synchrony: An Empirical Study

Nock¹,

Iyengar²,

Neti³

2003

View full text Add to dashboard Cite

<title>Discovery and fusion of salient multimodal features toward news story segmentation</title>

Hsu

Chang

Huang

et al. 2003

View full text Add to dashboard Cite

Joint visual-text modeling for automatic retrieval of multimedia documents

Iyengar

Duygulu

Feng

et al. 2005

View full text Add to dashboard Cite

In this paper we describe a novel approach for jointly modeling the text and the visual components of multimedia documents for the purpose of information retrieval(IR). We propose a novel framework where individual components are developed to model different relationships between documents and queries and then combined into a joint retrieval framework. In the state-of-the-art systems, a late combination between two independent systems, one analyzing just the text part of such documents, and the other analyzing the visual part without leveraging any knowledge acquired in the text processing, is the norm. Such systems rarely exceed the performance of any single modality (i.e. text or video) in information retrieval tasks. Our experiments indicate that allowing a rich interaction between the modalities results in significant improvement in performance over any single modality. We demonstrate these results using the TRECVID03 corpus, which comprises 120 hours of broadcast news videos. Our results demonstrate over 14% improvement in IR performance over the best reported textonly baseline and ranks amongst the best results reported on this corpus.

show abstract

A cascade image transform for speaker independent automatic speechreading

Potamianos

Verma²,

Neti³

et al.

View full text Add to dashboard Cite

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

G. Iyengar

Semantic Indexing of Multimedia Content Using Visual, Audio, and Text Cues

Speaker Localisation Using Audio-Visual Synchrony: An Empirical Study

<title>Discovery and fusion of salient multimodal features toward news story segmentation</title>

Joint visual-text modeling for automatic retrieval of multimedia documents

A cascade image transform for speaker independent automatic speechreading

Contact Info

Product

Resources

About