&lt;title&gt;Discovery and fusion of salient multimodal features toward news story segmentation&lt;/title&gt;

Hsu, Winston H.; Chang, Shih‐Fu; Huang, Chih‐Wei; Kennedy, Lyndon; Lin, Ching‐Yung; Iyengar, G.

doi:10.1117/12.533037

Cited by 27 publications

(48 citation statements)

References 11 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…For example, field-to-studio shot transition is a salient story boundary cue. This is because many broadcast news programs follow a clear pattern: each news story starts with a studio shot and then moves to field shots [8]. An anchor face is another visual feature indicating a topic transition [9].…”

Section: Introductionmentioning

confidence: 99%

Broadcast News Story Segmentation Using Conditional Random Fields and Multimodal Features

Wang

Xie

et al. 2012

IEICE Trans. Inf. & Syst.

View full text Add to dashboard Cite

SUMMARYIn this paper, we propose integration of multimodal features using conditional random fields (CRFs) for the segmentation of broadcast news stories. We study story boundary cues from lexical, audio and video modalities, where lexical features consist of lexical similarity, chain strength and overall cohesiveness; acoustic features involve pause duration, pitch, speaker change and audio event type; and visual features contain shot boundaries, anchor faces and news title captions. These features are extracted in a sequence of boundary candidate positions in the broadcast news. A linear-chain CRF is used to detect each candidate as boundary/non-boundary tags based on the multimodal features. Important interlabel relations and contextual feature information are effectively captured by the sequential learning framework of CRFs. Story segmentation experiments show that the CRF approach outperforms other popular classifiers, including decision trees (DTs), Bayesian networks (BNs), naive Bayesian classifiers (NBs), multilayer perception (MLP), support vector machines (SVMs) and maximum entropy (ME) classifiers.

show abstract

Section: Introductionmentioning

confidence: 99%

Broadcast News Story Segmentation Using Conditional Random Fields and Multimodal Features

Wang

Xie

et al. 2012

IEICE Trans. Inf. & Syst.

View full text Add to dashboard Cite

show abstract

“…Discriminative models with specifically designed feature representation (e.g., bag of features [113], fisher scores [66]) and a similarity metric (e.g., EarthMover's Distance [116], string kernels [84]) have also shown good detection performance in domains like computational biology and text classification. Discriminative models have also been used to model video events such as story segmentation [63] or short-term events [40], [150], [154] with promising results.…”

Section: )mentioning

confidence: 99%

“…A news story is Ba segment of a news broadcast with a coherent news focus which contains at least two independent, declarative clauses[ [139]. State-of-the-art detection algorithms achieve good segmentation results, with an F1 measure up to 0.74 [22], [27], [58], [63]. This is done by employing machine learning techniques such as SVM and HMM, along with judicious use of multimodal features such as shot length (production effect) or prosody in the anchor speech (content feature).…”

Section: ) Detecting Production Eventsmentioning

confidence: 99%

Event Mining in Multimedia Streams

2008

View full text Add to dashboard Cite

| Events are real-world occurrences that unfold over space and time. Event mining from multimedia streams improves the access and reuse of large media collections, and it has been an active area of research with notable recent progress. This paper contains a survey on the problems and solutions in event mining, approached from three aspects: event description, event-modeling components, and current event mining systems. We present a general characterization of multimedia events, motivated by the maxim of five BW[s and one BH[ for reporting real-world events in journalism: when, where, who, what, why, and how. We discuss the causes for semantic variability in real-world descriptions, including multilevel event semantics, implicit semantics facets, and the influence of context. We discuss five main aspects of an event detection system. These aspects are: the variants of tasks and event definitions that constrain system design, the media capture setup that collectively define the available data and necessary domain assumptions, the feature extraction step that converts the captured data into perceptually significant numeric or symbolic forms, statistical models that map the feature representations to richer semantic descriptions, and applications that use event metadata to help in different information-seeking tasks. We review current event-mining systems in detail, grouping them by the problem formulations and approaches. The review includes detection of events and actions in one or more continuous sequences, events in edited video streams, unsupervised event discovery, events in a collection of media objects, and a discussion on ongoing benchmark activities. These problems span a wide range of multimedia domains such as surveillance, meetings, broadcast news, sports, documentary, and films, as well as personal and online media collections. We conclude this survey with a brief outlook on open research directions.

show abstract

“…1(a). Multi-modal fusion for unsupervised learning differs from those for supervised learning [8] in that neither labeled ground-truth nor class separability is available as the computational criteria for guiding the fusion model. Therefore we use the data likelihood in generative models as an alternative criterion to optimize the multilevel dynamic mixture model.…”

Section: Layered Dynamic Mixture Modelmentioning

confidence: 99%

“…A story is defined [6] as a segment of a news broadcast with a coherent news focus which contains at least two independent, declarative clauses. Shot boundaries in news can be reliably detected with over 90% accuracy, while state-of-the-art audio-visual story segmentation has an F1 measure ∼ 75% [8].…”

Section: Processing Multi-modal Inputmentioning

confidence: 99%

Layered Dynamic Mixture Model for Pattern Discovery in Asynchronous Multi-modal Streams

Xie

Kennedy

Chang

et al.

Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005.

View full text Add to dashboard Cite

We propose a layered dynamic mixture model for asynchronous multi-modal fusion for unsupervised pattern discovery in video. The lower layer of the model uses generative temporal structures such as a hierarchical hidden Markov model to convert the audio-visual streams into mid-level labels, it also models the correlations in text with probabilistic latent semantic analysis. The upper layer fuses the statistical evidence across diverse modalities with a flexible meta-mixture model that assumes loose temporal correspondence. Evaluation on a large news database shows that multi-modal clusters have better correspondence to news topics than audiovisual clusters alone; novel analysis techniques suggest that meaningful clusters occur when the prediction of salient features by the model concurs with those shown in the story clusters. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)This work may not be copied or reproduced in whole or in part for any commercial purpose. Permission to copy in whole or in part without payment of fee is granted for nonprofit educational and research purposes provided that all such whole or partial copies include the following: a notice that such copying is by permission of Mitsubishi Electric Research Laboratories, Inc.; an acknowledgment of the authors and individual contributions to the work; and all applicable portions of the copyright notice. Copying, reproduction, or republishing for any other purpose shall require a license with payment of fee to Mitsubishi Electric Research Laboratories, Inc. All rights reserved.

show abstract

<title>Discovery and fusion of salient multimodal features toward news story segmentation</title>

Cited by 27 publications

References 11 publications

Broadcast News Story Segmentation Using Conditional Random Fields and Multimodal Features

Broadcast News Story Segmentation Using Conditional Random Fields and Multimodal Features

Event Mining in Multimedia Streams

Layered Dynamic Mixture Model for Pattern Discovery in Asynchronous Multi-modal Streams

Contact Info

Product

Resources

About