Identification of story units in audio-visual sequences by joint audio and video processing

Saraceno, C.; Leonardi, Riccardo

doi:10.1109/icip.1998.723500

Cited by 32 publications

(22 citation statements)

References 5 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…First, a formal definition of the event can be given as a model which is then used for detection. Such approaches are usually based on rules that can be either defined by a human expert (see, e.g., Saraceno and Leonardi (1998), Tovinkere and Qian (2001), Lienhart et al (1998), Zhong and Chang (2001)) or inferred from examples as in Perlovsky (1998). Alternately, machine learning techniques can be used to train a system from examples with the goal of deciding whether a video extract contains the event or not.…”

Section: Introductionmentioning

confidence: 99%

Classification-oriented structure learning in Bayesian networks for multimodal event detection in videos

Gravier

Demarty

Baghdadi

et al. 2012

Multimed Tools Appl

View full text Add to dashboard Cite

We investigate the use of structure learning in Bayesian networks for a complex multimodal task of action detection in soccer videos. We illustrate that classical score-oriented structure learning algorithms, such as the K2 one whose usefulness has been demonstrated on simple tasks, fail in providing a good network structure for classification tasks where many correlated observed variables are necessary to make a decision. We then compare several structure learning objective functions, which aim at finding out the structure that yields the best classification results, extending existing solutions in the literature. Experimental results on a comprehensive data set of 7 videos show that a discriminative objective function based on conditional likelihood yields the best results, while augmented approaches offer a good compromise between learning speed and classification accuracy.

show abstract

Section: Introductionmentioning

confidence: 99%

Classification-oriented structure learning in Bayesian networks for multimodal event detection in videos

Gravier

Demarty

Baghdadi

et al. 2012

Multimed Tools Appl

View full text Add to dashboard Cite

show abstract

“…Otherwise, these highenergy segments are checked for periodicity using an autocorrelation function. Since both voiced sounds and music may have significant peaks in their autocorrelation function, Zero Crossing Rate (ZCR) [4] of these signals is also measured. ZCR detects abrupt changes which should occur in speech signals due to existence of both voiced (low ZCR) and unvoiced (high ZCR) sounds.…”

Section: Audio Analysismentioning

confidence: 99%

“…Audio analysis is achieved based on an algorithm in [4]. According to this algorithm, an audio track is segmented into four classes as, silence, speech, music and noise.…”

Section: Audio Analysismentioning

confidence: 99%

“…While the deterministic methods usually cluster consecutive shots by utilizing appropriate measures [6,4], the probabilistic methods use hidden Markov models (HMM) to represent their states (i.e. scenes) in the content [7,8,9].…”

Section: Dialogue Scene Analysismentioning

confidence: 99%

“…Although, it has obvious advantages over uni-modal analysis, the fusion of data with different nature is still a difficult problem to solve. Multi-modal approaches have been investigated for shot-boundary detection [2] speaker-dependent temporal indexing [3], story unit identification [4] and violent scene detection [5] while showing improvements over uni-modal counterparts.…”

Section: Multi-modal Scene Analysismentioning

confidence: 99%

See 2 more Smart Citations

Automatic multi-modal dialogue scene indexing

Alatan

Proceedings 2001 International Conference on Image Processing (Cat. No.01CH37205)

View full text Add to dashboard Cite

An automatic algorithm for indexing dialogue scenes in multimedia content is proposed. The content is segmented into dialogue scenes using the state transitions of a hidden Markov model (HMM). Each shot is classified using both audio and visual information to determine the state/scene transitions for this model. Face detection and silence/speech/music classification are the basic tools which are utilized to index the scenes. While face information is extracted after applying some heuristics to skin-colored regions, audio analysis is achieved by examining signal energy, periodicity and zero crossing rate (ZCR) of the audio waveform. The simulation results show the possibility of automatically indexing the dialogues using the proposed algorithm.

show abstract