Lipreading Architecture Based on Multiple Convolutional Neural Networks for Sentence-Level Visual Speech Recognition

Jeon, Sanghun; Elsharkawy, Ahmed; Kim, Mun Sang

doi:10.3390/s22010072

Cited by 20 publications

(9 citation statements)

References 41 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Additionally, following the "transition" structure shown in Figure S1A(b) (Table S1), a standard dropout layer was connected, and pixels were randomly dropped to prevent strong correlation in feature maps between successive frames [39]. In addition, the spatial dropout layer connected to the "transition" structure, shown in Figure S2C, was effectively used to extract fine movement features such as lips, teeth, and tongue with strong spatial correlation [31][32][33][34][35][36][37][38][39][40][41][42][43]. Therefore, the proposed dense spatial-temporal CNN network comprises one layer that represents a nonlinear transformation Hl, and the output of the layer can be expressed as x l (3), where x 0 , x 1 , ÁÁÁ, and x (l-1) denote the volume of the 3D feature created in the previous layer and [ÁÁÁ] denotes a concatenation operation.…”

Section: Visual Modulementioning

confidence: 99%

Multimodal audiovisual speech recognition architecture using a three‐feature multi‐fusion method for noise‐robust systems

Jeon,

Lee,

Yeo

et al. 2024

ETRI Journal

Self Cite

View full text Add to dashboard Cite

Exposure to varied noisy environments impairs the recognition performance of artificial intelligence‐based speech recognition technologies. Degraded‐performance services can be utilized as limited systems that assure good performance in certain environments, but impair the general quality of speech recognition services. This study introduces an audiovisual speech recognition (AVSR) model robust to various noise settings, mimicking human dialogue recognition elements. The model converts word embeddings and log‐Mel spectrograms into feature vectors for audio recognition. A dense spatial–temporal convolutional neural network model extracts features from log‐Mel spectrograms, transformed for visual‐based recognition. This approach exhibits improved aural and visual recognition capabilities. We assess the signal‐to‐noise ratio in nine synthesized noise environments, with the proposed model exhibiting lower average error rates. The error rate for the AVSR model using a three‐feature multi‐fusion method is 1.711%, compared to the general 3.939% rate. This model is applicable in noise‐affected environments owing to its enhanced stability and recognition rate.

show abstract

Section: Visual Modulementioning

confidence: 99%

Multimodal audiovisual speech recognition architecture using a three‐feature multi‐fusion method for noise‐robust systems

Jeon,

Lee,

Yeo

et al. 2024

ETRI Journal

Self Cite

View full text Add to dashboard Cite

show abstract

“…This proposed architecture [12] has been very effective. It has achieved a 17.5% recognizable improvement in accuracy in the lip-reading datasets LRW [13] and LRW-1000 [14], and even now, many state-of-the-art lip-reading solutions still use it as its meta-architecture [15][16][17][18]. Following the level of recognized speech, in this section, we present related work in the current research area by distributing it into four sub-sections:…”

Section: Related Workmentioning

confidence: 99%

Visual Lip-Reading for Quranic Arabic Alphabets and Words Using Deep Learning

Aljohani¹,

Jaha²

2023

Computer Systems Science and Engineering

View full text Add to dashboard Cite

The continuing advances in deep learning have paved the way for several challenging ideas. One such idea is visual lip-reading, which has recently drawn many research interests. Lip-reading, often referred to as visual speech recognition, is the ability to understand and predict spoken speech based solely on lip movements without using sounds. Due to the lack of research studies on visual speech recognition for the Arabic language in general, and its absence in the Quranic research, this research aims to fill this gap. This paper introduces a new publicly available Arabic lip-reading dataset containing 10490 videos captured from multiple viewpoints and comprising data samples at the letter level (i.e., single letters (single alphabets) and Quranic disjoined letters) and in the word level based on the content and context of the book Al-Qaida Al-Noorania. This research uses visual speech recognition to recognize spoken Arabic letters (Arabic alphabets), Quranic disjoined letters, and Quranic words, mainly phonetic as they are recited in the Holy Quran according to Quranic study aid entitled Al-Qaida Al-Noorania. This study could further validate the correctness of pronunciation and, subsequently, assist people in correctly reciting Quran. Furthermore, a detailed description of the created dataset and its construction methodology is provided. This new dataset is used to train an effective pre-trained deep learning CNN model throughout transfer learning for lip-reading, achieving the accuracies of 83.3%, 80.5%, and 77.5% on words, disjoined letters, and single letters, respectively, where an extended analysis of the results is provided. Finally, the experimental outcomes, different research aspects, and dataset collection consistency and challenges are discussed and concluded with several new promising trends for future work.

show abstract

“…Therefore, the end-to-end deep learning architecture is the natural direction of scientific research as opposed to the previous manual feature extraction classification methodology for automatic lip-reading technology. Researchers use convolutional neural networks (CNN) to focus on interest regions in the last few years, and they have also been quite successful at classifying images and detecting targets [8], [9].…”

Section: Article Historymentioning

confidence: 99%

An automatic lip reading for short sentences using deep learning nets

Rajab

Hashim²

2023

Int. J. Adv. Intell. Informatics

View full text Add to dashboard Cite

One study whose importance has significantly grown in recent years is lip-reading, particularly with the widespread of using deep learning techniques. Lip reading is essential for speech recognition in noisy environments or for those with hearing impairments. It refers to recognizing spoken sentences using visual information acquired from lip movements. Also, the lip area, especially for males, suffers from several problems, such as the mouth area containing the mustache and beard, which may cover the lip area. This paper proposes an automatic lip-reading system to recognize and classify short English sentences spoken by speakers using deep learning networks. The input video extracts frames and each frame is passed to the Viola-Jones to detect the face area. Then 68 landmarks of the facial area are determined, and the landmarks from 48 to 68 represent the lip area extracted based on building a binary mask. Then, the contrast is enhanced to improve the quality of the lip image by applying contrast adjustment. Finally, sentences are classified using two deep learning models, the first is AlexNet, and the second is VGG-16 Net. The database consists of 39 participants (32 males and 7 females). Each participant repeats the short sentences five times. The outcomes demonstrate the accuracy rate of AlexNet is 90.00%, whereas the accuracy rate for VGG-16 Net is 82.34%. We concluded that AlexNet performs better for classifying short sentences than VGG-16 Net.

show abstract

Lipreading Architecture Based on Multiple Convolutional Neural Networks for Sentence-Level Visual Speech Recognition

Cited by 20 publications

References 41 publications

Multimodal audiovisual speech recognition architecture using a three‐feature multi‐fusion method for noise‐robust systems

Multimodal audiovisual speech recognition architecture using a three‐feature multi‐fusion method for noise‐robust systems

Visual Lip-Reading for Quranic Arabic Alphabets and Words Using Deep Learning

An automatic lip reading for short sentences using deep learning nets

Contact Info

Product

Resources

About