End-to-End Sentence-Level Multi-View Lipreading Architecture with Spatial Attention Module Integrated Multiple CNNs and Cascaded Local Self-Attention-CTC

Jeon, Sanghun; Kim, Mun Sang

doi:10.3390/s22093597

Cited by 8 publications

(6 citation statements)

References 53 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The current state of the art is 98.31% [93]. The results on this dataset tend to be much higher than those collected in the wild due to the limitation of vocabulary (digit sequences and assigned phrases) as well as the constrained lab scenario the dataset was collected in.…”

Section: Ouluvs2mentioning

confidence: 97%

“…Year Language SOTA Accuracy Speech Scenario GRID [90] 2006 English 98.7% [91] Structured sentences GRID-Lombard [99] 2018 English N/A Structured sentences OuluVS2 [92] 2015 English 98.31% [93] Controlled sentences MODALITY [94] 2017 English 54.00% [94] Controlled sentences LRS [45] 2017 English 49.8% [45] TV interviews MV-LRS [84] 2017 English 47.2% [84] TV programs LRS2-BBC [95] 2018 English 64.8% [96] TV programs LRS3-TED [97] 2018 English 63.7% [98] Formal lectures CMLR [101] 2019 Mandarin 67.52% [101] TV programs LSVSR [100] 2018 English 59.1% [100] YouTube videos YTDEV18 [102] 2019 English N/A YouTube videos SynthVSR [103] 2023 English N/A Synthetic data…”

Section: Datasetmentioning

confidence: 99%

See 1 more Smart Citation

Data-Driven Advancements in Lip Motion Analysis: A Review

Torrie,

Sumsion,

Lee

et al. 2023

Electronics

View full text Add to dashboard Cite

This work reviews the dataset-driven advancements that have occurred in the area of lip motion analysis, particularly visual lip-reading and visual lip motion authentication, in the deep learning era. We provide an analysis of datasets and their usage, creation, and associated challenges. Future research can utilize this work as a guide for selecting appropriate datasets and as a source of insights for creating new and innovative datasets. Large and varied datasets are vital to a successful deep learning system. There have been many incredible advancements made in these fields due to larger datasets. There are indications that even larger, more varied datasets would result in further improvement upon existing systems. We highlight the datasets that brought about the progression in lip-reading systems from digit- to word-level lip-reading, and then from word- to sentence-level lip-reading. Through an in-depth analysis of lip-reading system results, we show that datasets with large amounts of diversity increase results immensely. We then discuss the next step for lip-reading systems to move from sentence- to dialogue-level lip-reading and emphasize that new datasets are required to make this transition possible. We then explore lip motion authentication datasets. While lip motion authentication has been well researched, it is not very unified on a particular implementation, and there is no benchmark dataset to compare the various methods. As was seen in the lip-reading analysis, large, diverse datasets are required to evaluate the robustness and accuracy of new methods attempted by researchers. These large datasets have pushed the work in the visual lip-reading realm. Due to the lack of large, diverse, and publicly accessible datasets, visual lip motion authentication research has struggled to validate results and real-world applications. A new benchmark dataset is required to unify the studies in this area such that they can be compared to previous methods as well as validate new methods more effectively.

show abstract

Section: Ouluvs2mentioning

confidence: 97%

Section: Datasetmentioning

confidence: 99%

Data-Driven Advancements in Lip Motion Analysis: A Review

Torrie,

Sumsion,

Lee

et al. 2023

Electronics

View full text Add to dashboard Cite

show abstract

“…Additionally, following the "transition" structure shown in Figure S1A(b) (Table S1), a standard dropout layer was connected, and pixels were randomly dropped to prevent strong correlation in feature maps between successive frames [39]. In addition, the spatial dropout layer connected to the "transition" structure, shown in Figure S2C, was effectively used to extract fine movement features such as lips, teeth, and tongue with strong spatial correlation [31][32][33][34][35][36][37][38][39][40][41][42][43]. Therefore, the proposed dense spatial-temporal CNN network comprises one layer that represents a nonlinear transformation Hl, and the output of the layer can be expressed as x l (3), where x 0 , x 1 , ÁÁÁ, and x (l-1) denote the volume of the 3D feature created in the previous layer and [ÁÁÁ] denotes a concatenation operation.…”

Section: Visual Modulementioning

confidence: 99%

Multimodal audiovisual speech recognition architecture using a three‐feature multi‐fusion method for noise‐robust systems

Jeon,

Lee,

Yeo

et al. 2024

ETRI Journal

Self Cite

View full text Add to dashboard Cite

Exposure to varied noisy environments impairs the recognition performance of artificial intelligence‐based speech recognition technologies. Degraded‐performance services can be utilized as limited systems that assure good performance in certain environments, but impair the general quality of speech recognition services. This study introduces an audiovisual speech recognition (AVSR) model robust to various noise settings, mimicking human dialogue recognition elements. The model converts word embeddings and log‐Mel spectrograms into feature vectors for audio recognition. A dense spatial–temporal convolutional neural network model extracts features from log‐Mel spectrograms, transformed for visual‐based recognition. This approach exhibits improved aural and visual recognition capabilities. We assess the signal‐to‐noise ratio in nine synthesized noise environments, with the proposed model exhibiting lower average error rates. The error rate for the AVSR model using a three‐feature multi‐fusion method is 1.711%, compared to the general 3.939% rate. This model is applicable in noise‐affected environments owing to its enhanced stability and recognition rate.

show abstract

“…Many studies initially centered on 2D fully convolutional networks [4,5]. But as hardware improved, 3D convolutions over 2D convolutions [6][7][8] or recurrent neural networks [9,10] quickly became an option for more effective use of temporal information. This concept has developed into a specific architecture consisting of two parts: the front end and the back end.…”

Section: Related Workmentioning

confidence: 99%

Visual Lip-Reading for Quranic Arabic Alphabets and Words Using Deep Learning

Aljohani¹,

Jaha²

2023

Computer Systems Science and Engineering

View full text Add to dashboard Cite

The continuing advances in deep learning have paved the way for several challenging ideas. One such idea is visual lip-reading, which has recently drawn many research interests. Lip-reading, often referred to as visual speech recognition, is the ability to understand and predict spoken speech based solely on lip movements without using sounds. Due to the lack of research studies on visual speech recognition for the Arabic language in general, and its absence in the Quranic research, this research aims to fill this gap. This paper introduces a new publicly available Arabic lip-reading dataset containing 10490 videos captured from multiple viewpoints and comprising data samples at the letter level (i.e., single letters (single alphabets) and Quranic disjoined letters) and in the word level based on the content and context of the book Al-Qaida Al-Noorania. This research uses visual speech recognition to recognize spoken Arabic letters (Arabic alphabets), Quranic disjoined letters, and Quranic words, mainly phonetic as they are recited in the Holy Quran according to Quranic study aid entitled Al-Qaida Al-Noorania. This study could further validate the correctness of pronunciation and, subsequently, assist people in correctly reciting Quran. Furthermore, a detailed description of the created dataset and its construction methodology is provided. This new dataset is used to train an effective pre-trained deep learning CNN model throughout transfer learning for lip-reading, achieving the accuracies of 83.3%, 80.5%, and 77.5% on words, disjoined letters, and single letters, respectively, where an extended analysis of the results is provided. Finally, the experimental outcomes, different research aspects, and dataset collection consistency and challenges are discussed and concluded with several new promising trends for future work.

show abstract

End-to-End Sentence-Level Multi-View Lipreading Architecture with Spatial Attention Module Integrated Multiple CNNs and Cascaded Local Self-Attention-CTC

Cited by 8 publications

References 53 publications

Data-Driven Advancements in Lip Motion Analysis: A Review

Data-Driven Advancements in Lip Motion Analysis: A Review

Multimodal audiovisual speech recognition architecture using a three‐feature multi‐fusion method for noise‐robust systems

Visual Lip-Reading for Quranic Arabic Alphabets and Words Using Deep Learning

Contact Info

Product

Resources

About