Cross-language Speech Dependent Lip-synchronization

Jha, Abhishek; Voleti, Vikram; Namboodiri, Vinay P.; Jawahar, C. V.

doi:10.1109/icassp.2019.8682275

Cited by 7 publications

(4 citation statements)

References 19 publications

(20 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…As reported before in [12], a large amount of online educational content is present in English in the form of video lectures. They are often aided with subtitles of foreign languages.…”

Section: Educational Videosmentioning

confidence: 69%

Towards Automatic Face-to-Face Translation

Prajwal

Mukhopadhyay

Philip

et al. 2019

Proceedings of the 27th ACM International Conference on Multimedia

Self Cite

View full text Add to dashboard Cite

Figure 1: In light of the increasing amount of audio-visual content in our digital communication, we examine the extent to which current translation systems handle the different modalities in such media. We extend the existing systems that can only provide textual transcripts or translated speech for talking face videos to also translate the visual modality i.e. lip and mouth movements. Consequently, our proposed pipeline produces fully translated talking face videos with corresponding lip synchronization. ABSTRACTIn light of the recent breakthroughs in automatic machine translation systems, we propose a novel approach that we term as "Faceto-Face Translation". As today's digital communication becomes increasingly visual, we argue that there is a need for systems that can automatically translate a video of a person speaking in language A into a target language B with realistic lip synchronization. In this work, we create an automatic pipeline for this problem and demonstrate its impact in multiple real-world applications. First, we

show abstract

“…As reported before in [12], a large amount of online educational content is present in English in the form of video lectures. They are often aided with subtitles of foreign languages.…”

Section: Educational Videosmentioning

confidence: 69%

Towards Automatic Face-to-Face Translation

Prajwal

Mukhopadhyay

Philip

et al. 2019

Proceedings of the 27th ACM International Conference on Multimedia

Self Cite

View full text Add to dashboard Cite

show abstract

“…Automated dubbing A common approach to automated dubbing is to generate or modify the video frames to match a given clip of audio speech [2,34,35,36,37,38,39,40,41]. This wide and active area of research uses approaches that vary from conditional video generation, to retrieval, to 3D models.…”

Section: Datasetsmentioning

confidence: 99%

More than Words: In-the-Wild Visually-Driven Prosody for Text-to-Speech

Hassid¹,

Ramanovich²,

Shillingford³

et al. 2021

Preprint

View full text Add to dashboard Cite

In this paper we present VDTTS, a Visually-Driven Textto-Speech model. Motivated by dubbing, VDTTS takes advantage of video frames as an additional input alongside text, and generates speech that matches the video signal. We demonstrate how this allows VDTTS to, unlike plain TTS models, generate speech that not only has prosodic variations like natural pauses and pitch, but is also synchronized to the input video.Experimentally, we show our model produces wellsynchronized outputs, approaching the video-speech synchronization quality of the ground-truth, on several challenging benchmarks including "in-the-wild" content from VoxCeleb2. We encourage the reader to view the demo videos demonstrating video-speech synchronization, robustness to speaker ID swapping, and prosody.

show abstract

“…Lip synchronization Generating talking mouth videos by conditioning on audio [14,33,34,44] is more applicable to tackling audiovisual dubbing. Further literature uses similar models conditioned on text [45], videos of other speakers [46], and facial landmarks [47]. Recent approaches improve the quality and sharpness of the generated clips using GANs [45,34,44].…”

Section: Related Workmentioning

confidence: 99%

“…The vast majority of the literature focuses on face generation for English content only. Only recent efforts [47,34], which came out contemporaneously with our work, have attempted to tackle the problem of audiovisual dubbing from English to Hindi. In our work, we aim for more systematic study of multilingual scenario.…”

Section: Multi-lingual Av Translationmentioning

confidence: 99%

Large-scale multilingual audio visual dubbing

Yang,

Shillingford,

Assael

et al. 2020

Preprint

View full text Add to dashboard Cite

We describe a system for large-scale audiovisual translation and dubbing, which translates videos from one language to another. The source language's speech content is transcribed to text, translated, and automatically synthesized into target language speech using the original speaker's voice. The visual content is translated by synthesizing lip movements for the speaker to match the translated audio, creating a seamless audiovisual experience in the target language. The audio and visual translation subsystems each contain a large-scale generic synthesis model trained on thousands of hours of data in the corresponding domain. These generic models are fine-tuned to a specific speaker before translation, either using an auxiliary corpus of data from the target speaker, or using the video to be translated itself as the input to the fine-tuning process. This report gives an architectural overview of the full system, as well as an in-depth discussion of the video dubbing component. The role of the audio and text components in relation to the full system is outlined, but their design is not discussed in detail. Translated and dubbed demo videos generated using our system can be viewed at https://www.youtube.com/playlist?list=PLSi232j2ZA6_1Exhof5vndzyfbxAhhEs5.

show abstract

Cross-language Speech Dependent Lip-synchronization

Cited by 7 publications

References 19 publications

Towards Automatic Face-to-Face Translation

Towards Automatic Face-to-Face Translation

More than Words: In-the-Wild Visually-Driven Prosody for Text-to-Speech

Large-scale multilingual audio visual dubbing

Contact Info

Product

Resources

About