Automatic Visual Speech Recognition

Chitu, Alin; Rothkrantz, Léon J. M.

doi:10.5772/36466

Cited by 10 publications

(5 citation statements)

References 50 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…This dataset and the solutions evaluated on it predated the deep learning revolution. Rothkrantz et al achieved the state-of-the-art result on the NDUTAVSC dataset with an accuracy of 84.27% [73]. This dataset is not used as a metric for many ALR models due to the fact that it is in Dutch and due to the lack of variation within the dataset.…”

Section: Ndutavscmentioning

confidence: 99%

Data-Driven Advancements in Lip Motion Analysis: A Review

Torrie,

Sumsion,

Lee

et al. 2023

Electronics

View full text Add to dashboard Cite

This work reviews the dataset-driven advancements that have occurred in the area of lip motion analysis, particularly visual lip-reading and visual lip motion authentication, in the deep learning era. We provide an analysis of datasets and their usage, creation, and associated challenges. Future research can utilize this work as a guide for selecting appropriate datasets and as a source of insights for creating new and innovative datasets. Large and varied datasets are vital to a successful deep learning system. There have been many incredible advancements made in these fields due to larger datasets. There are indications that even larger, more varied datasets would result in further improvement upon existing systems. We highlight the datasets that brought about the progression in lip-reading systems from digit- to word-level lip-reading, and then from word- to sentence-level lip-reading. Through an in-depth analysis of lip-reading system results, we show that datasets with large amounts of diversity increase results immensely. We then discuss the next step for lip-reading systems to move from sentence- to dialogue-level lip-reading and emphasize that new datasets are required to make this transition possible. We then explore lip motion authentication datasets. While lip motion authentication has been well researched, it is not very unified on a particular implementation, and there is no benchmark dataset to compare the various methods. As was seen in the lip-reading analysis, large, diverse datasets are required to evaluate the robustness and accuracy of new methods attempted by researchers. These large datasets have pushed the work in the visual lip-reading realm. Due to the lack of large, diverse, and publicly accessible datasets, visual lip motion authentication research has struggled to validate results and real-world applications. A new benchmark dataset is required to unify the studies in this area such that they can be compared to previous methods as well as validate new methods more effectively.

show abstract

Section: Ndutavscmentioning

confidence: 99%

Data-Driven Advancements in Lip Motion Analysis: A Review

Torrie,

Sumsion,

Lee

et al. 2023

Electronics

View full text Add to dashboard Cite

show abstract

“…Audio information detects the acoustic waveform of a speaker, whereas visual information detects lip movements [1]. Despite the challenges such as auditory recognition in noisy environments, audiovisual speech recognition (AVSR) is widely investigated and is reported to exhibit excellent recognition capabilities [2][3][4][5][6]. AVSR is used in technologies such as Microsoft Azure, Google Assistant, and Amazon Alexa, which convert analog signals into digital formats by acoustically analyzing speech and automatically transcribing it into Sanghun Jeon, Jieun Lee, and Dohyeon Yeo equally contributed to this work.…”

Section: Introductionmentioning

confidence: 99%

Multimodal audiovisual speech recognition architecture using a three‐feature multi‐fusion method for noise‐robust systems

Jeon,

Lee,

Yeo

et al. 2024

ETRI Journal

View full text Add to dashboard Cite

Exposure to varied noisy environments impairs the recognition performance of artificial intelligence‐based speech recognition technologies. Degraded‐performance services can be utilized as limited systems that assure good performance in certain environments, but impair the general quality of speech recognition services. This study introduces an audiovisual speech recognition (AVSR) model robust to various noise settings, mimicking human dialogue recognition elements. The model converts word embeddings and log‐Mel spectrograms into feature vectors for audio recognition. A dense spatial–temporal convolutional neural network model extracts features from log‐Mel spectrograms, transformed for visual‐based recognition. This approach exhibits improved aural and visual recognition capabilities. We assess the signal‐to‐noise ratio in nine synthesized noise environments, with the proposed model exhibiting lower average error rates. The error rate for the AVSR model using a three‐feature multi‐fusion method is 1.711%, compared to the general 3.939% rate. This model is applicable in noise‐affected environments owing to its enhanced stability and recognition rate.

show abstract

“…Vision plays a crucial role in speech understanding, and the importance of utilizing visual information to improve the performance and robustness of speech recognition has been demonstrated [ 2 , 3 , 4 ]. Although acoustic information is richer than visual information when speaking, most people rely on watching lip movements to fully understand speech [ 2 ]. Furthermore, people rely on visual information in noisy environments where receiving auditory information is challenging.…”

Section: Introductionmentioning

confidence: 99%

Lipreading Architecture Based on Multiple Convolutional Neural Networks for Sentence-Level Visual Speech Recognition

Jeon

Elsharkawy

Kim

2021

Sensors

View full text Add to dashboard Cite

In visual speech recognition (VSR), speech is transcribed using only visual information to interpret tongue and teeth movements. Recently, deep learning has shown outstanding performance in VSR, with accuracy exceeding that of lipreaders on benchmark datasets. However, several problems still exist when using VSR systems. A major challenge is the distinction of words with similar pronunciation, called homophones; these lead to word ambiguity. Another technical limitation of traditional VSR systems is that visual information does not provide sufficient data for learning words such as “a”, “an”, “eight”, and “bin” because their lengths are shorter than 0.02 s. This report proposes a novel lipreading architecture that combines three different convolutional neural networks (CNNs; a 3D CNN, a densely connected 3D CNN, and a multi-layer feature fusion 3D CNN), which are followed by a two-layer bi-directional gated recurrent unit. The entire network was trained using connectionist temporal classification. The results of the standard automatic speech recognition evaluation metrics show that the proposed architecture reduced the character and word error rates of the baseline model by 5.681% and 11.282%, respectively, for the unseen-speaker dataset. Our proposed architecture exhibits improved performance even when visual ambiguity arises, thereby increasing VSR reliability for practical applications.

show abstract

Automatic Visual Speech Recognition

Cited by 10 publications

References 50 publications

Data-Driven Advancements in Lip Motion Analysis: A Review

Data-Driven Advancements in Lip Motion Analysis: A Review

Multimodal audiovisual speech recognition architecture using a three‐feature multi‐fusion method for noise‐robust systems

Lipreading Architecture Based on Multiple Convolutional Neural Networks for Sentence-Level Visual Speech Recognition

Contact Info

Product

Resources

About