Rodrigo Mira scite author profile

Self-supervision has recently shown great promise for learning visual and auditory speech representations from unlabelled data. In this work, we propose BRAVEn, an extension to the recent RAVEn method, which learns speech representations entirely from raw audio-visual data. Our modifications to RAVEn enable BRAVEn to achieve state-of-the-art results among self-supervised methods in various settings. Moreover, we observe favourable scaling behaviour by increasing the amount of unlabelled data well beyond other self-supervised works. In particular, we achieve 20.0 % / 1.7 % word error rate for VSR / ASR on the LRS3 test set, with only 30 hours of labelled data and no external ASR models. Our results suggest that readily available unlabelled audiovisual data can largely replace costly transcribed data. Code at https://github.com/ahaliassos/raven.

show abstract

SVTS: Scalable Video-to-Speech Synthesis

Mira¹,

Haliassos²,

Petridis³

et al. 2022

Preprint

View full text Add to dashboard Cite

LiRA: Learning Visual Speech Representations from Audio through Self-supervision

Mira²,

Petridis

et al. 2021

Preprint

View full text Add to dashboard Cite

The large amount of audiovisual content being shared online today has drawn substantial attention to the prospect of audiovisual self-supervised learning. Recent works have focused on each of these modalities separately, while others have attempted to model both simultaneously in a cross-modal fashion. However, comparatively little attention has been given to leveraging one modality as a training objective to learn from the other. In this work, we propose Learning visual speech Representations from Audio via self-supervision (LiRA). Specifically, we train a ResNet+Conformer model to predict acoustic features from unlabelled visual speech. We find that this pre-trained model can be leveraged towards word-level and sentence-level lip-reading through feature extraction and fine-tuning experiments. We show that our approach significantly outperforms other self-supervised methods on the Lip Reading in the Wild (LRW) dataset and achieves state-of-the-art performance on Lip Reading Sentences 2 (LRS2) using only a fraction of the total labelled data.

show abstract

LiRA: Learning Visual Speech Representations from Audio Through Self-Supervision

Mira²,

Petridis

et al. 2021

View full text Add to dashboard Cite

show abstract

Jointly Learning Visual and Auditory Speech Representations from Raw Data

Haliassos¹,

Ma²,

Mira³

et al. 2022

Preprint

View full text Add to dashboard Cite

Automated composition of Galician Xota—tuning RNN-based composers for specific musical styles using deep Q-learning

Mira

Coutinho

Parada-Cabaleiro

et al. 2023

View full text Add to dashboard Cite

Music composition is a complex field that is difficult to automate because the computational definition of what is good or aesthetically pleasing is vague and subjective. Many neural network-based methods have been applied in the past, but they lack consistency and in most cases, their outputs fail to impress. The most common issues include excessive repetition and a lack of style and structure, which are hallmarks of artificial compositions. In this project, we build on a model created by Magenta—the RL Tuner—extending it to emulate a specific musical genre—the Galician Xota. To do this, we design a new rule-set containing rules that the composition should follow to adhere to this style. We then implement them using reward functions, which are used to train the Deep Q Network that will be used to generate the pieces. After extensive experimentation, we achieve an implementation of our rule-set that effectively enforces each rule on the generated compositions, and outline a solid research methodology for future researchers looking to use this architecture. Finally, we propose some promising future work regarding further applications for this model and improvements to the experimental procedure.

show abstract

LA-VOCE: LOW-SNR Audio-Visual Speech Enhancement Using Neural Vocoders

Mira

Donley

et al. 2023

View full text Add to dashboard Cite

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Rodrigo Mira

End-to-End Video-to-Speech Synthesis Using Generative Adversarial Networks

Leveraging Real Talking Faces via Self-Supervision for Robust Forgery Detection

SVTS: Scalable Video-to-Speech Synthesis

LiRA: Learning Visual Speech Representations from Audio through Self-supervision

LiRA: Learning Visual Speech Representations from Audio Through Self-Supervision

Jointly Learning Visual and Auditory Speech Representations from Raw Data

Automated composition of Galician Xota—tuning RNN-based composers for specific musical styles using deep Q-learning

LA-VOCE: LOW-SNR Audio-Visual Speech Enhancement Using Neural Vocoders

Contact Info

Product

Resources

About