Pose-Aware Speech Driven Facial Landmark Animation Pipeline for Automated Dubbing

Bigioi, Dan; Jordan, Hugh; Jain, Rishabh; McDonnell, Rachel; Corcoran, Peter

doi:10.1109/access.2022.3231137

Cited by 5 publications

(8 citation statements)

References 39 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…For a detailed illustration, refer to Fig 4a and Eqn 4. This mechanism is similar to the approaches used in other studies such as [64] and [65], where models were conditioned with information from the previous frame to guarantee temporal consistency in the context of video generation. Essentially, maintaining consistent colors throughout a video sequence becomes more achievable when the model can "remember" the colors from the previous frame.…”

Section: Temporal Consistencymentioning

confidence: 99%

LatentColorization: Latent Diffusion-Based Speaker Video Colorization

Ward,

Bigioi,

Basak

et al. 2024

IEEE Access

View full text Add to dashboard Cite

While current research predominantly focuses on image-based colorization, the domain of video-based colorization remains relatively unexplored. Most existing video colorization techniques operate on a frameby-frame basis, often overlooking the critical aspect of temporal coherence between successive frames. This approach can result in inconsistencies across frames, leading to undesirable effects like flickering or abrupt color transitions between frames. To address these challenges, we harness the generative capabilities of a fine-tuned latent diffusion model designed specifically for video colorization, introducing a novel solution for achieving temporal consistency in video colorization, as well as demonstrating strong improvements on established image quality metrics compared to other existing methods. Furthermore, we perform a subjective study, where users preferred our approach to the existing state of the art. Our dataset encompasses a combination of conventional datasets and videos from television/movies. In short, by leveraging the power of a fine-tuned latent diffusion-based colorization system with a temporal consistency mechanism, we can improve the performance of automatic video colorization by addressing the challenges of temporal inconsistency. A short demonstration of our results can be seen in some example videos available at https://youtu.be/vDbzsZdFuxM.

show abstract

Section: Temporal Consistencymentioning

confidence: 99%

LatentColorization: Latent Diffusion-Based Speaker Video Colorization

Ward,

Bigioi,

Basak

et al. 2024

IEEE Access

View full text Add to dashboard Cite

show abstract

“…Talking Head refers to generating a computer-generated virtual character or avatar or a static picture that can speak and emulate human-like facial expressions, lip movements, and emotions based on an audio track. This technology combines artificial intelligence, computer vision, and natural language processing to create a lifelike person's digital representation that can deliver speech realistically and expressively [9]- [13].…”

Section: Talking Headmentioning

confidence: 99%

“…By leveraging deep learning algorithms, the system analyzes audio input, interprets the speech content, and generates synchronized lip movements and facial expressions that closely align with the spoken words [9]- [11]. This process creates a seamless lip-syncing effect, making it appear that the virtual character speaks the dialogue naturally and convincingly.…”

Section: Talking Headmentioning

confidence: 99%

“…Long Short-Term Memory LSTM-based: The study carried out in 2022 unveiled a novel LSTM-based pipeline for generating pose-aware 3D facial landmarks synchronized with speech [9]. Unlike other approaches, this method directly outputs lip movements synchronized with audio while con-sidering the speaker's head pose and movement.…”

Section: ) Facial Expression Synchronization and Face-to-face Transla...mentioning

confidence: 99%

See 1 more Smart Citation

A Systematic Literature Review: Facial Expression and Lip Movement Synchronization of an Audio Track

Hassan Alshahrani,

Maashi

2024

IEEE Access

View full text Add to dashboard Cite

This systematic literature review (SLR) explores the topic of Facial Expression and Lip Movement Synchronization of an Audio Track in the context of Automatic Dubbing. This SLR aims to gain insights into the advancements, trends, and limitations of technologies related to this topic. The review protocol was well-defined, and we formulated specific research questions to guide the review process. A comprehensive search strategy included reputable electronic databases and relevant keywords. The study selection process followed inclusion and exclusion criteria, including 32 relevant studies published from 1995 to 2024. We systematically performed data extraction to gather critical information from the selected studies. The findings of this SLR contribute to understanding the current state-of-the-art technologies, identifying research gaps, and informing further advancements in this field.

show abstract

“…Within the context of talking head generation, and video editing there are a number of recent works that have explored using diffusion models. Specifically, Stypułkowski et al (2023), Shen et al (2023), andBigioi et al (2023) being among the first to explore their use for endto-end talking head generation and audio driven video editing. All three methods follow a similar auto-regressive frame-based approach where the previously generated frame is fed back into the model along with the audio signal and a reference identity frame to generate the next frame in the sequence.…”

Section: Diffusion-based Generationmentioning

confidence: 99%

Multilingual video dubbing—a technology review and current challenges

Bigioi,

Corcoran

2023

Front. Signal Process.

Self Cite

View full text Add to dashboard Cite

The proliferation of multi-lingual content on today’s streaming services has created a need for automated multi-lingual dubbing tools. In this article, current state-of-the-art approaches are discussed with reference to recent works in automatic dubbing and the closely related field of talking head generation. A taxonomy of papers within both fields is presented, and the main challenges of both speech-driven automatic dubbing, and talking head generation are discussed and outlined, together with proposals for future research to tackle these issues.

show abstract

Pose-Aware Speech Driven Facial Landmark Animation Pipeline for Automated Dubbing

Cited by 5 publications

References 39 publications

LatentColorization: Latent Diffusion-Based Speaker Video Colorization

LatentColorization: Latent Diffusion-Based Speaker Video Colorization

A Systematic Literature Review: Facial Expression and Lip Movement Synchronization of an Audio Track

Multilingual video dubbing—a technology review and current challenges

Contact Info

Product

Resources

About