Harriet J. Nock scite author profile

We present a learning-based approach to the semantic indexing of multimedia content using cues derived from audio, visual, and text features. We approach the problem by developing a set of statistical models for a predefined lexicon. Novel concepts are then mapped in terms of the concepts in the lexicon. To achieve robust detection of concepts, we exploit features from multiple modalities, namely, audio, video, and text. Concept representations are modeled using Gaussian mixture models (GMM), hidden Markov models (HMM), and support vector machines (SVM). Models such as Bayesian networks and SVMs are used in a latefusion approach to model concepts that are not explicitly modeled in terms of features. Our experiments indicate promise in the proposed classification and fusion methodologies: our proposed fusion scheme achieves more than 10% relative improvement over the best unimodal concept detector.

show abstract

Stochastic pronunciation modelling from hand-labelled phonetic corpora

Riley

Byrne

Finke

et al. 1999

Speech Communication

View full text Add to dashboard Cite

Speaker Localisation Using Audio-Visual Synchrony: An Empirical Study

Nock¹,

Iyengar²,

Neti³

2003

View full text Add to dashboard Cite

Pronunciation modeling by sharing Gaussian densities across phonetic models

Saraçlar

Nock

Khudanpur

2000

Computer Speech & Language

View full text Add to dashboard Cite

Joint visual-text modeling for automatic retrieval of multimedia documents

Iyengar

Duygulu

Feng

et al. 2005

View full text Add to dashboard Cite

In this paper we describe a novel approach for jointly modeling the text and the visual components of multimedia documents for the purpose of information retrieval(IR). We propose a novel framework where individual components are developed to model different relationships between documents and queries and then combined into a joint retrieval framework. In the state-of-the-art systems, a late combination between two independent systems, one analyzing just the text part of such documents, and the other analyzing the visual part without leveraging any knowledge acquired in the text processing, is the norm. Such systems rarely exceed the performance of any single modality (i.e. text or video) in information retrieval tasks. Our experiments indicate that allowing a rich interaction between the modalities results in significant improvement in performance over any single modality. We demonstrate these results using the TRECVID03 corpus, which comprises 120 hours of broadcast news videos. Our results demonstrate over 14% improvement in IR performance over the best reported textonly baseline and ranks amongst the best results reported on this corpus.

show abstract

Audio-visual synchrony for detection of monologues in video archives

Iyengar

Nock

Neti

2003

View full text Add to dashboard Cite

In this paper we present our approach to detect monologues in video shots. A monologue shot is defined as a shot containing a talking person in the video channel with the corresponding speech in the audio channel. Whilst motivated by the TREC 7002 Video Retrieval Track (VTOZ), the underlying approach of synchrony between audio and video signals are also applicable for voice and ,face-based biometrics, assessing of lip-synchronization quality in movie editing. and for speaker localization in video. Our approach 'is envisioned as a two part scheme. We first detect Occurrence of speech and face in a video shot. In shots containing both speech and a face, we distinguish monologue shots as those shots where the speech and facial movements are synchronized. To measure the synchrony between speech and facial movements we use a mutual-information based measure. Experiments with the VT02 corpus indicate that using synchrony, the average precision improves by more than 50% relative compared to using face and speech information alone. Our synchrony based monologue detector submission had the best average precision performance (in VTOZ) amongst lgdifferent submissions.

show abstract

Discriminative model fusion for semantic concept detection and annotation in video

Iyengar

Nock

2003

View full text Add to dashboard Cite

In this paper we describe a general information fusion algorithm that can be used to incorporate multimodal cues in building user-defined semantic concept models. We compare this technique with a Bayesian Network-based approach on a semantic concept detection task. Results indicate that this technique yields superior performance. We demonstrate this approach further by building classifiers of arbitrary concepts in a score space defined by a pre-deployed set of multimodal concepts. Results show annotation for user-defined concepts both in and outside the pre-deployed set is competitive with our best video-only models on the TREC Video 2002 corpus.

show abstract

Assessing face and speech consistency for monologue detection in video

Nock

Iyengar

Neti

2002

View full text Add to dashboard Cite

12 3 4

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Harriet J. Nock

Semantic Indexing of Multimedia Content Using Visual, Audio, and Text Cues

Stochastic pronunciation modelling from hand-labelled phonetic corpora

Speaker Localisation Using Audio-Visual Synchrony: An Empirical Study

Pronunciation modeling by sharing Gaussian densities across phonetic models

Joint visual-text modeling for automatic retrieval of multimedia documents

Audio-visual synchrony for detection of monologues in video archives

Discriminative model fusion for semantic concept detection and annotation in video

Assessing face and speech consistency for monologue detection in video

Contact Info

Product

Resources

About