Deep Learning-Based Holistic Speaker Independent Visual Speech Recognition

Nemani, Praneeth; Krishna, Ghanta Sai; Ramisetty, Nikhil; Sai, B Digvijay Sri; Kumar, Santosh

doi:10.1109/tai.2022.3220190

Cited by 12 publications

(1 citation statement)

References 155 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Te consistency of the results between accuracy and the F1 score is also not far away. For the original dataset, MediaPipe + LRCN with 3 CNN layers has superior results (87%) compared to Inception V3 (86.6%) [48], CNN (52.9%) [48], VGG-16+LSTM [47] (59%), 3D-CNN [51] (70.2), and 3D-CNN + LSTM [52] (85%).…”

mentioning

confidence: 99%

Indonesian Lip‐Reading Detection and Recognition Based on Lip Shape Using Face Mesh and Long‐Term Recurrent Convolutional Network

Aripin,

Setiawan

2024

Applied Computational Intelligence and Soft Computing

View full text Add to dashboard Cite

Communication through speech can be hindered by environmental noise, prompting the need for alternative methods such as lip reading, which bypasses auditory challenges. However, the accurate interpretation of lip movements is impeded by the uniqueness of individual lip shapes, necessitating detailed analysis. In addition, the development of an Indonesian dataset addresses the lack of diversity in existing datasets, predominantly in English, fostering more inclusive research. This study proposes an enhanced lip-reading system trained using the long-term recurrent convolutional network (LRCN) considering eight different types of lip shapes. MediaPipe Face Mesh precisely detects lip landmarks, enabling the LRCN model to recognize Indonesian utterances. Experimental results demonstrate the effectiveness of the approach, with the LRCN model with three convolutional layers (LRCN-3Conv) achieving 95.42% accuracy for word test data and 95.63% for phrases, outperforming the convolutional long short-term memory (Conv-LSTM) method. The proposed approach outperforms Conv-LSTM in terms of accuracy. Furthermore, the evaluation of the original MIRACL-VC1 dataset also produced a best accuracy of 90.67% on LRCN-3Conv compared to previous studies in the word-labeled class. The success is attributed to MediaPipe Face Mesh detection, which facilitates the accurate detection of the lip region. Leveraging advanced deep learning techniques and precise landmark detection, these findings promise improved communication accessibility for individuals facing auditory challenges.

show abstract

mentioning

confidence: 99%