Detection of Documentary Scene Changes by Audio-Visual Fusion

Velivelli, Atulya; Ngo, Chong‐Wah; Huang, Thomas S.

doi:10.1007/3-540-45113-7_23

Cited by 10 publications

(6 citation statements)

References 9 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…In the cited work, a Bayesian approach to determine biases of experts has been used to guide the fusion of audio and video streams. Fusion of audio and video modalities has been found to be useful for detection of documentary scene changes [26]. In [29], fusion of text and video has been proposed for story segmentation in news video.…”

Section: Multimedia Fusionmentioning

confidence: 99%

Semantics reinforcement and fusion learning for multimedia streams

Joshi

Naphade

Natsev

2007

Proceedings of the 6th ACM International Conference on Image and Video Retrieval

View full text Add to dashboard Cite

Fusion of multimedia streams for enhanced performance is a critical problem for retrieval. However, fusion performance tends to easily overfit the hillclimb set used to learn fusion rules. In this paper, we perform fusion learning for multimedia streams using a greedy performance driven algorithm. In our fusion learning paradigm, fused output is a linear combination of multiple classifiers or ranked streams. The algorithm is inspired from Ensemble Learning [2] but takes that idea further for improving generalization capability.A key application of our fusion learning algorithm, described in this work, is semantics reinforcement using an ensemble of classifiers built using the same training dataset but groundtruth corresponding to different concepts. We expect that classifiers built for semantically close concepts should reinforce each other's performance and fusion learning is an excellent post-classification way to reinforce semantics and performance. Fusion learning experiments have been performed on TRECVID 2005 test set.Experiments using the well established retrieval effectiveness measure of mean average precision reveal that our proposed algorithm improves over the best classifier (oracle) by 3.8%.We also present and discuss some interesting and intuitive semantic reinforcement trends observed during fusion learning.

show abstract

Section: Multimedia Fusionmentioning

confidence: 99%

Semantics reinforcement and fusion learning for multimedia streams

Joshi

Naphade

Natsev

2007

Proceedings of the 6th ACM International Conference on Image and Video Retrieval

View full text Add to dashboard Cite

show abstract

“…This approach treats the features as Ñ modalities, with features in the Ø modality ( ½ ¡ ¡ ¡ Ñ ). Most work in image and video retrieval analysis (e.g., [2,13,26,28,31]) employs this approach. For example, the QBIC system [13] supported image queries based on combining distances from the color and texture modalities.…”

mentioning

confidence: 99%

“…For example, the QBIC system [13] supported image queries based on combining distances from the color and texture modalities. Velivelli et al [31] separated video features into audio and visual modalities. IBM video analysis [2] also regarded each media track (visual, audio, textual, etc.)…”

mentioning

confidence: 99%

Optimal multimodal fusion for multimedia data analysis

Chang

et al. 2004

Proceedings of the 12th Annual ACM International Conference on Multimedia

197

131

View full text Add to dashboard Cite

Considerable research has been devoted to utilizing multimodal features for better understanding multimedia data. However, two core research issues have not yet been adequately addressed. First, given a set of features extracted from multiple media sources (e.g., extracted from the visual, audio, and caption track of videos), how do we determine the best modalities? Second, once a set of modalities has been identified, how do we best fuse them to map to semantics? In this paper, we propose a two-step approach. The first step finds statistically independent modalities from raw features. In the second step, we use super-kernel fusion to determine the optimal combination of individual modalities. We carefully analyze the tradeoffs between three design factors that affect fusion performance: modality independence, curse of dimensionality, and fusion-model complexity. Through analytical and empirical studies, we demonstrate that our two-step approach, which achieves a careful balance of the three design factors, can improve class-prediction accuracy over traditional techniques.

show abstract

“…A cena é definida de maneiras distintas na literatura, por ser um conceito subjetivo. Velivelli et al (2003). Velivelli et al (2003), por exemplo, definem a cena como uma coleção de tomadas que são temporalmente unificadas ou que ocorrem em uma mesma localidade.…”

Section: Estrutura De Vídeounclassified

“…Velivelli et al (2003). Velivelli et al (2003), por exemplo, definem a cena como uma coleção de tomadas que são temporalmente unificadas ou que ocorrem em uma mesma localidade. No entanto, segundo Choi e Lee (2010), a definição correntemente aceita para cena é: um conjunto de tomadas que retrata uma única ideia, tema ou conceito, sem limitações de tempo ou espaço.…”

Section: Estrutura De Vídeounclassified

Detecção de cenas em segmentos semanticamente complexos

Lopes¹

View full text Add to dashboard Cite

Dedico esse trabalho a meus pais e à minha noiva, que sempre me apoiaram e ajudaram em todos os momentos. AgradecimentosAgradeço em primeiro lugar a Deus, por ter me iluminado durante todo o desenvolvimento do trabalho, dando a paciência e a inspiração necessária para sua realização.Agradeço também, a meu orientador, pelos infindáveis conselhos e por sua orientação sempre tão pertinente.Agradeço aos professores das matérias realizadas no mestrado, que certamente contribuíram beneficamente para a realização dessa pesquisa.Agradeço aos colegas e amigos do laboratório de pesquisa, que sempre me apoiaram e me deram forças nos momentos de desânimo.Agradeço ao CNPq pelo auxílio financeiro, processo n°134245/2011-3. Agradeço à FAPESP pelo auxílio financeiro, processo n°2011/05238-0, Fundação de Amparo à Pesquisa do Estado de São Paulo (FAPESP). "As opiniões, hipóteses e conclusões ou recomendações expressas neste material são de responsabilidade do(s) autor(es) e não necessariamente refletem a visão da FAPESP". Resumo Diversas áreas da Computação (Personalização e Adaptação de Conteúdo, Recuperação de Informação, entre outras) se beneficiam da segmentação de vídeo em unidades menores de informação. A literatura apresenta diversos métodos e técnicas cujo objetivo é identificar essas unidades. Uma limitação é que tais técnicas não tratam o problema da detecção de cenas em segmentos semanticamente complexos, definidos como trechos de vídeo que apresentam mais de um assunto ou tema, e cuja semân-tica latente dificilmente pode ser determinada utilizando-se somente uma única mídia. Esses segmentos são muito relevantes, pois estão presentes em diversos domínios de vídeo, tais como filmes, noticiários e mesmo comerciais.A presente Dissertação de Mestrado propõe uma técnica de segmentação de vídeo capaz de identificar cenas em segmentos semanticamente complexos. Para isso utiliza a semântica latente alcançada com o uso de Bag of Visual Words para agrupar os segmentos de um vídeo. O agrupamento é baseado em multimodalidade, analisando-se características visuais e sonoras de cada vídeo e combinando-se os resultados por meio da estratégia fusão tardia. O presente trabalho demonstra a viabilidade técnica em reconhecer cenas em segmentos semanticamente complexos. The literature reports lots of techniques and methods, whose goal is to identify these units. One of these techniques' limitations is that they don't handle scene detection in semantically complex segments, which are defined as video snippets that present more than one subject or theme, whose latent semantics can hardly be determined using only one media. Those segments are very relevant, since they are present in multiple video domains as movies, news and even television commercials. This Master's dissertation proposes a video scene segmentation technique able to detect scenes in semantically complex segments. In order to achieve this goal it uses latent semantics extracted by the Bag of Visual Words to group a video segments. This grouping process is based on multimodalit...

show abstract

Detection of Documentary Scene Changes by Audio-Visual Fusion

Cited by 10 publications

References 9 publications

Semantics reinforcement and fusion learning for multimedia streams

Semantics reinforcement and fusion learning for multimedia streams

Optimal multimodal fusion for multimedia data analysis

Detecção de cenas em segmentos semanticamente complexos

Contact Info

Product

Resources

About