Proceedings of the 24th ACM International Conference on Multimedia 2016
DOI: 10.1145/2964284.2967211
|View full text |Cite
|
Sign up to set email alerts
|

Learning Multimodal Temporal Representation for Dubbing Detection in Broadcast Media

Abstract: Person discovery in the absence of prior identity knowledge requires accurate association of visual and auditory cues. In broadcast data, multimodal analysis faces additional challenges due to narrated voices over muted scenes or dubbing in different languages. To address these challenges, we define and analyze the problem of dubbing detection in broadcast data, which has not been explored before. We propose a method to represent the temporal relationship between the auditory and visual streams. This method co… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
18
0

Year Published

2017
2017
2023
2023

Publication Types

Select...
4
2

Relationship

0
6

Authors

Journals

citations
Cited by 16 publications
(18 citation statements)
references
References 28 publications
(27 reference statements)
0
18
0
Order By: Relevance
“…For each database, the Train set was used to train PCA matrix, which was then applied to all combined features for all samples in the database. Canonical-correlation analysis (CCA) is also sometimes [2], [3] used to harmonize features of two modalities prior to the dimensionality reduction, but, in our experiments, we found this technique to have little effect on the results (about 1% reduction in error) and, therefore, do not report it in this paper.…”
Section: Processing Featuresmentioning
confidence: 82%
See 4 more Smart Citations
“…For each database, the Train set was used to train PCA matrix, which was then applied to all combined features for all samples in the database. Canonical-correlation analysis (CCA) is also sometimes [2], [3] used to harmonize features of two modalities prior to the dimensionality reduction, but, in our experiments, we found this technique to have little effect on the results (about 1% reduction in error) and, therefore, do not report it in this paper.…”
Section: Processing Featuresmentioning
confidence: 82%
“…As per the latest related work [3], [4], [5], [6], we also use 13 MFCC features with their delta, double-delta derivatives [11], and energy (40 coefficients in total) to characterize speech in audio. MFCCs are computed from a power spectrum (power of magnitude of 512-sized FFT) on 20ms-long windows with 10ms overlap.…”
Section: B Audio Featuresmentioning
confidence: 99%
See 3 more Smart Citations