Unsupervised detection of multimodal clusters in edited recordings

Dielmann, Alfred

doi:10.1109/mmsp.2010.5662015

Cited by 6 publications

(12 citation statements)

References 19 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Note that, this paper differs from [11] in two major ways. First of all, we propose a method that does not need initial partitioning of the audio and visual data.…”

Section: Introductionmentioning

confidence: 90%

“…This idea is introduced in [12] where the single most consistent pair of clusters is selected according to some heuristics on the pattern of occurrence of structurally relevant events. A rather similar philosophy is used in [11] to select pairs of segments assuming each segment is labeled. The key difference is that in [11] a unique segmentation is used in each modality, with cluster labels attached to segments, rather than a nested hierarchy of clusters.…”

Section: Overviewmentioning

confidence: 99%

“…We elaborate on the work of Ben and Gravier [12][13] and of Dielmann [11]. The former have proposed an unsupervised method to detect a single audiovisual structural event without any prior knowledge.…”

Section: Introductionmentioning

confidence: 99%

“…As in most cases, discovery of multiple events is not considered while many videos exhibit several structural events (e.g., two anchor persons, guests in talk shows). However, the work of Dielmann [11] was designed to select multiple pairs of audio and visual clusters from two independent partitions of the data. To this end, Pearson's χ 2 statistical test is adopted to analyze the co-occurrences between audio and visual labels (clusters).…”

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

Unsupervised Mining of Multiple Audiovisually Consistent Clusters for Video Structure Analysis

Gravier

2012

2012 IEEE International Conference on Multimedia and Expo

View full text Add to dashboard Cite

We address the problem of detecting multiple audiovisual events related to the edit structure of a video by incorporating an unsupervised cluster analysis technique into a cluster selection method designed to measure coherence between audio and visual segments. First, mutual information measure is used to select audio-visually consistent clusters from two dendrograms representing hierarchical clustering results respectively for the audio and visual modalities. A cluster analysis technique is then applied to define events from the audio-visual (AV) clusters with segments co-occurring frequently. Candidate events are then characterized by groups of AV clusters from which models are built by automatically selecting positive and negative examples. Experiments on the standard Canal9 data set demonstrates that our method is capable of discovering multiple audiovisual events in a totally unsupervised manner.

show abstract

“…Note that, this paper differs from [11] in two major ways. First of all, we propose a method that does not need initial partitioning of the audio and visual data.…”

Section: Introductionmentioning

confidence: 90%

Section: Overviewmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Unsupervised Mining of Multiple Audiovisually Consistent Clusters for Video Structure Analysis

Gravier

2012

2012 IEEE International Conference on Multimedia and Expo

View full text Add to dashboard Cite

show abstract

“…However, the association between speech and face can introduce many ambiguities in case of multi-face shots, as shown in the first image of Related work. Earlier work on AV person diarization performs separately audio and video clustering in a first step and associate the clusters in a second step [2,3,4]. The most simple clue to associate faces and speakers is their temporal co-occurrence.…”

Section: Introductionmentioning

confidence: 99%

A conditional random field approach for audio-visual people diarization

Khoury

Meignier²,

Odobez

et al. 2014

2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

We investigate the problem of audio-visual (AV) person diarization in broadcast data. That is, automatically associate the faces and voices of people and determine when they appear or speak in the video. The contributions are twofolds. First, we formulate the problem within a novel CRF framework that simultaneously performs the AV association of voices and face clusters to build AV person models, and the joint segmentation of the audio and visual streams using a set of AV cues and their association strength. Secondly, we use for this AV association strength a score that does not only rely on lips activity, but also on contextual visual information (face size, position, number of detected faces,. . . ) that leads to more reliable association measures. Experiments on 6 hours of broadcast data show that our framework is able to improve the AV-person diarization especially for speaker segments erroneously labeled in the mono-modal case.

show abstract

Audiovisual diarization of people in video content

Khoury

Sènac

Joly

2012

Multimed Tools Appl

View full text Add to dashboard Cite

Audio-Visual People Diarization (AVPD) is an original framework that simultaneously improves audio, video, and audiovisual diarization results. Following a literature review of people diarization for both audio and video content and their limitations, which includes our own contributions, we describe a proposed method for associating both audio and video information by using co-occurrence matrices and present experiments which were conducted on a corpus containing TV news, TV debates, and movies. Results show the effectiveness of the overall diarization system and confirm the gains audio information can bring to video indexing and vice versa.

show abstract

Unsupervised detection of multimodal clusters in edited recordings

Cited by 6 publications

References 19 publications

Unsupervised Mining of Multiple Audiovisually Consistent Clusters for Video Structure Analysis

Unsupervised Mining of Multiple Audiovisually Consistent Clusters for Video Structure Analysis

A conditional random field approach for audio-visual people diarization

Audiovisual diarization of people in video content

Contact Info

Product

Resources

About