Learning Multimodal Temporal Representation for Dubbing Detection in Broadcast Media

Le, Nam; Odobez, Jean-Marc

doi:10.1145/2964284.2967211

Cited by 16 publications

(18 citation statements)

References 28 publications

(27 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…For each database, the Train set was used to train PCA matrix, which was then applied to all combined features for all samples in the database. Canonical-correlation analysis (CCA) is also sometimes [2], [3] used to harmonize features of two modalities prior to the dimensionality reduction, but, in our experiments, we found this technique to have little effect on the results (about 1% reduction in error) and, therefore, do not report it in this paper.…”

Section: Processing Featuresmentioning

confidence: 82%

“…As per the latest related work [3], [4], [5], [6], we also use 13 MFCC features with their delta, double-delta derivatives [11], and energy (40 coefficients in total) to characterize speech in audio. MFCCs are computed from a power spectrum (power of magnitude of 512-sized FFT) on 20ms-long windows with 10ms overlap.…”

Section: B Audio Featuresmentioning

confidence: 99%

“…Therefore, we concatenate visual features and MFCC features into one joint vector. Since visual features are extracted at 25 per seconds (as per the video frame rate) and audio features at 100 per seconds (MFFCs are computed on windows shifted by 10ms), we tried the following ways to combine features: (i) upsample video frame rate to match 100 fps of the audio, (ii) downsample audio features to match video 25 fps [5], and (iii) for a fixed temporal window, combine all features into one vector [3]. Third option allows preserving local temporal context in each resulted feature vector that can be learnt by the classifier.…”

Section: Processing Featuresmentioning

confidence: 99%

“…Typically, most of the latest approaches [2], [3], [4], [5], [6], [7] for lip-syncing or dubbing detection focus on extracting separate feature sets for audio and video. For audio, melscale frequency cepstral coefficients (MFCC) are usually used, while different visual features, varying from optical flow [3] to features learned with deep neural networks (DNNs) [6], are extracted from the mouth region of a face. The features then undergo some processing before they are fed into a classifier, best performing examples including long short-term memory (LSTM) [4] or convolutional neural networks [7].…”

Section: Introductionmentioning

confidence: 99%

“…In this paper, building upon related work, we selected a range of approaches suitable for audio-visual inconsistency detection and performed a preliminary study of different feature processing techniques, classifiers, and their parameters on different databases with tampering attacks. For our tampering detection systems, we used MFFCs as audio features [3] and distances between mouth landmarks as visual features (inspired by [8]). We explored different ways to post-process the features, including ways to combine two types of features, reduce the dimensionality of blocks of features with principal component analysis (PCA), and project both modalities into a common space with canonical correspondence analysis (CCA).…”

Section: Introductionmentioning

confidence: 99%

See 4 more Smart Citations

Speaker Inconsistency Detection in Tampered Video

Korshunov

Marcel

2018

2018 26th European Signal Processing Conference (EUSIPCO)

View full text Add to dashboard Cite

With the increasing amount of video being consumed by people daily, there is a danger of the rise in maliciously modified video content (i.e., 'fake news') that could be used to damage innocent people or to impose a certain agenda, e.g., meddle in elections. In this paper, we consider audio manipulations in video of a person speaking to the camera. Such manipulation is easy to perform, for instance, one can just replace a part of audio, while it can dramatically change the message and the meaning of the video. With the goal to develop an automated system that can detect these audiovisual speaker inconsistencies, we consider several approaches proposed for lip-syncing and dubbing detection, based on convolutional and recurrent networks and compare them with systems that are based on more traditional classifiers. We evaluated these methods on publicly available databases VidTIMIT, AMI, and GRID, for which we generated sets of tampered data.

show abstract

Section: Processing Featuresmentioning

confidence: 82%

Section: B Audio Featuresmentioning

confidence: 99%

Section: Processing Featuresmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

Speaker Inconsistency Detection in Tampered Video

Korshunov

Marcel

2018

2018 26th European Signal Processing Conference (EUSIPCO)

View full text Add to dashboard Cite

show abstract

Deepfake source detection in a heart beat

Çiftçi,

Demir,

Yin

2023

Vis Comput

View full text Add to dashboard Cite

The defalsif-AI project: protecting critical infrastructures against disinformation and fake news

Schreiber

Picus

Fischinger

et al. 2021

Elektrotech. Inftech.

View full text Add to dashboard Cite

In this paper, we describe the concept and ongoing work of the project defalsif-AI, which addresses the protection of critical infrastructures against disinformation and fake news. Defalsif-AI deals particularly with the protection of the main democratic processes and the public trust in democracy and its institutions against engineered social media attacks, which, for example, attempt to manipulate the electoral process. Federal ministries and media institutions require new methods and tools to evaluate the ever increasing amount of digital media in terms of identification, verification, and correction of sources. Based on these requirements, the project focuses on research on audio-visual media forensics, text analysis, and multimodal fusion with the support of artificial intelligence (AI) and machine learning methods. One main focus of this research is to make the results more comprehensible and interpretable for non-experts in the forensic/technical field. The primary project outcome is a proof of concept of a multimodal detection platform, which can operate with a variety of sources, including the surface web and social media. Additional research carried out within the project focuses on providing and generating multimodal data necessary to train and test machine learning models. Finally, an analysis and assessment concerning the law and social science are carried out as well.

show abstract

Learning Multimodal Temporal Representation for Dubbing Detection in Broadcast Media

Cited by 16 publications

References 28 publications

Speaker Inconsistency Detection in Tampered Video

Speaker Inconsistency Detection in Tampered Video

Deepfake source detection in a heart beat

The defalsif-AI project: protecting critical infrastructures against disinformation and fake news

Contact Info

Product

Resources

About