This work presents a scalable solution to open-vocabulary visual speech recognition. To achieve this, we constructed the largest existing visual speech recognition dataset, consisting of pairs of text and video clips of faces speaking (3,886 hours of video). In tandem, we designed and trained an integrated lipreading system, consisting of a video processing pipeline that maps raw video to stable videos of lips and sequences of phonemes, a scalable deep neural network that maps the lip videos to sequences of phoneme distributions, and a production-level speech decoder that outputs sequences of words. The proposed system achieves a word error rate (WER) of 40.9% as measured on a held-out set. In comparison, professional lipreaders achieve either 86.4% or 92.9% WER on the same dataset when having access to additional types of contextual information. Our approach significantly improves on other lipreading approaches, including variants of LipNet and of Watch, Attend, and Spell (WAS), which are only capable of 89.8% and 76.8% WER respectively. * These authors contributed equally to this work.
Classical face recognition techniques have been successful at operating under well-controlled conditions; however, they have difficulty in robustly performing recognition in uncontrolled real-world scenarios where variations in pose, illumination, and expression are encountered. In this paper, we propose a new method for real-world unconstrained pose-invariant face recognition. We first construct a 3D model for each subject in our database using only a single 2D image by applying the 3D Generic Elastic Model (3D GEM) approach. These 3D models comprise an intermediate gallery database from which novel 2D pose views are synthesized for matching. Before matching, an initial estimate of the pose of the test query is obtained using a linear regression approach based on automatic facial landmark annotation. Each 3D model is subsequently rendered at different poses within a limited search space about the estimated pose, and the resulting images are matched against the test query. Finally, we compute the distances between the synthesized images and test query by using a simple normalized correlation matcher to show the effectiveness of our pose synthesis method to real-world data. We present convincing results on challenging data sets and video sequences demonstrating high recognition accuracy under controlled as well as unseen, uncontrolled real-world scenarios using a fast implementation.
No abstract
Recent studies in biometrics have shown that the peri ocular region of the face is sufficiently discriminative for robust recognition, and particularly effective in certain sce narios such as extreme occlusions, and illumination vari ations where traditional face recognition systems are un reliable. In this paper, we first propose a fully automatic, robust and fast graph-cut based eyebrow segmentation tech nique to extract the eyebrow shape from a given face image. We then propose an eyebrow shape-based identification sys tem for periocular face recognition. Our experiments have been conducted over large datasets from the MBGC and AR databases and the resilience of the proposed approach has been evaluated under varying data conditions. The exper imental results show that the proposed eyebrow segmenta tion achieves high accuracy with an F-Measure of 99.4% and the identification system achieves rates of 76. 0% on the AR database and 85.0% on the MBGC database.
Automatic face recognition performance has been steadily improving over years of research, however it remains significantly affected by a number of factors such as illumination, pose, expression, resolution and other factors that can impact matching scores. The focus of this paper is the pose problem which remains largely overlooked in most real-world applications. Specifically, we focus on one-to-one matching scenarios where a query face image of a random pose is matched against a set of gallery images. We propose a method that relies on two fundamental components: (a) A 3D modeling step to geometrically correct the viewpoint of the face. For this purpose, we extend a recent technique for efficient synthesis of 3D face models called 3D Generic Elastic Model. (b) A sparse feature extraction step using subspace modeling and ℓ1-minimization to induce pose-tolerance in coefficient space. This in return enables the synthesis of an equivalent frontal-looking face, which can be used towards recognition. We show significant performance improvements in verification rates compared to commercial matchers, and also demonstrate the resilience of the proposed method with respect to degrading input quality. We find that the proposed technique is able to match non-frontal images to other non-frontal images of varying angles.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.