Joon Son Chung scite author profile

Most existing datasets for speaker identification contain samples obtained under quite constrained conditions, and are usually hand-annotated, hence limited in size. The goal of this paper is to generate a large scale text-independent speaker identification dataset collected 'in the wild'.We make two contributions. First, we propose a fully automated pipeline based on computer vision techniques to create the dataset from open-source media. Our pipeline involves obtaining videos from YouTube; performing active speaker verification using a two-stream synchronization Convolutional Neural Network (CNN), and confirming the identity of the speaker using CNN based facial recognition. We use this pipeline to curate VoxCeleb which contains hundreds of thousands of 'real world' utterances for over 1,000 celebrities.Our second contribution is to apply and compare various state of the art speaker identification techniques on our dataset to establish baseline performance. We show that a CNN based architecture obtains the best performance for both identification and verification.

show abstract

VoxCeleb2: Deep Speaker Recognition

Chung

2018

View full text Add to dashboard Cite

Lip Reading Sentences in the Wild

Chung¹,

Senior²,

Vinyals³

et al. 2017

589

534

View full text Add to dashboard Cite

The goal of this work is to recognise phrases and sentences being spoken by a talking face, with or without the audio. Unlike previous works that have focussed on recognising a limited number of words or phrases, we tackle lip reading as an open-world problem -unconstrained natural language sentences, and in the wild videos.Our key contributions are: (1) a 'Watch, Listen, Attend and Spell' (WLAS) network that learns to transcribe videos of mouth motion to characters; (2) a curriculum learning strategy to accelerate training and to reduce overfitting; (3) a 'Lip Reading Sentences' (LRS) dataset for visual speech recognition, consisting of over 100,000 natural sentences from British television.The WLAS model trained on the LRS dataset surpasses the performance of all previous work on standard lip reading benchmark datasets, often by a significant margin. This lip reading performance beats a professional lip reader on videos from BBC television, and we also demonstrate that visual information helps to improve speech recognition performance even when the audio is available.

show abstract

Deep Audio-Visual Speech Recognition

Afouras

Chung

Senior

et al. 2022

IEEE Trans. Pattern Anal. Mach. Intell.

402

433

View full text Add to dashboard Cite

The goal of this work is to recognise phrases and sentences being spoken by a talking face, with or without the audio. Unlike previous works that have focussed on recognising a limited number of words or phrases, we tackle lip reading as an open-world problem -unconstrained natural language sentences, and in the wild videos.Our key contributions are: (1) we compare two models for lip reading, one using a CTC loss, and the other using a sequence-to-sequence loss. Both models are built on top of the transformer self-attention architecture; (2) we investigate to what extent lip reading is complementary to audio speech recognition, especially when the audio signal is noisy; (3) we introduce and publicly release a new dataset for audio-visual speech recognition, LRS2-BBC, consisting of thousands of natural sentences from British television. The models that we train surpass the performance of all previous work on a lip reading benchmark dataset by a significant margin.

show abstract

Lip Reading in the Wild

2017

View full text Add to dashboard Cite

The Conversation: Deep Audio-Visual Speech Enhancement

Afouras¹,

Chung²,

Zisserman³

2018

282

326

View full text Add to dashboard Cite

Our goal is to isolate individual speakers from multi-talker simultaneous speech in videos. Existing works in this area have focussed on trying to separate utterances from known speakers in controlled environments. In this paper, we propose a deep audio-visual speech enhancement network that is able to separate a speaker's voice given lip regions in the corresponding video, by predicting both the magnitude and the phase of the target signal. The method is applicable to speakers unheard and unseen during training, and for unconstrained environments. We demonstrate strong quantitative and qualitative results, isolating extremely challenging real-world examples.

show abstract

Out of Time: Automated Lip Sync in the Wild

2017

View full text Add to dashboard Cite

The goal of this work is to determine the audio-video synchronisation between mouth motion and speech in a video. We propose a two-stream ConvNet architecture that enables the mapping between the sound and the mouth images to be trained end-to-end from unlabelled data. The trained network is used to determine the lip-sync error in a video. We apply the network to two further tasks: active speaker detection and lip reading. On both tasks we set a new state-of-the-art on standard benchmark datasets.

show abstract

Voxceleb: Large-scale speaker verification in the wild

Nagrani

Chung

Xie

et al. 2020

Computer Speech & Language

381

268

View full text Add to dashboard Cite

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Joon Son Chung

VoxCeleb: A Large-Scale Speaker Identification Dataset

VoxCeleb2: Deep Speaker Recognition

Lip Reading Sentences in the Wild

Deep Audio-Visual Speech Recognition

Lip Reading in the Wild

The Conversation: Deep Audio-Visual Speech Enhancement

Out of Time: Automated Lip Sync in the Wild

Voxceleb: Large-scale speaker verification in the wild

Contact Info

Product

Resources

About