Audio-Visual Speech Recognition System Using Recurrent Neural Network

Goh, Yeh Huann; Lau, Kai-Xian; Lee, Yoon-Ket

doi:10.1109/incit.2019.8912049

Cited by 8 publications

(2 citation statements)

References 24 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In lip-reading studies, it has been noted that some research uses its own datasets, albeit a small number (Lu & Yan, 2020;Goh et al 2019). Generally, studies are conducted on commonly used datasets (Petridis et al 2020;Mesbah et al 2019), known to be OuluVS2 (Anina et al 2015), AvLetters (Matthews et al 2002) and GRID.…”

Section: Datasetsmentioning

confidence: 99%

Derin Öğrenme ile Dudak Okuma Üzerine Detaylı Bir Araştırma

Erbey

Barışçı

2022

IJERAD

View full text Add to dashboard Cite

Derin öğrenme çalışmaları ile bilgisayarlı görü ve ses tanıma gibi alanlarda çok başarılı sonuçlar elde edilmiştir. Derin öğrenmenin bu alanlardaki başarıları ile insanların hayatını kolaylaştıran teknolojiler geliştirilmektedir. Bu teknolojilerden biri de ses tanıma cihazlarıdır. Yapılan araştırmalar sonucunda ses tanıma cihazlarının, gürültüsüz ortamlarda iyi sonuçlar vermesine rağmen gürültülü ortamlarda ise başarılarının düştüğü görülmektedir. Derin öğrenme yöntemleri ile gürültülü ortamlarda yaşanan ses tanıma problemleri görsel sinyaller kullanılarak çözülebilir. Bilgisayarlı görü sayesinde insan dudaklarının analizi ile karşıdaki kişinin ne konuştuğunun tespit edilerek ses tanıma cihazlarının başarıları artırılabilir. Bu çalışmada, dudak okuma ile ilgili derin öğrenme yöntemleri kullanan çalışmalar ve veri setleri tanıtılmıştır. Yapılan çalışma sonucunda dudak okumanın akademik olarak çalışılması gereken bir alan olduğu söylenebilir.

show abstract

Section: Datasetsmentioning

confidence: 99%

Derin Öğrenme ile Dudak Okuma Üzerine Detaylı Bir Araştırma

Erbey

Barışçı

2022

IJERAD

View full text Add to dashboard Cite

show abstract

“…The audio features are of many kinds. The three of them used in [18] are LPC,PLP, MFCC. The study shows that the MFCC has the highest accuracy of about 94.6% for Hindi Language in noiseless environment.…”

Section: Introductionmentioning

confidence: 99%

Integration of Audio video Speech Recognition using LSTM and Feed Forward Convolutional Neural Network

Shashidhar

Kulkarni

2021

Preprint

View full text Add to dashboard Cite

In the current scenario, audio visual speech recognition is one of the emerging fields of research, but there is still deficiency of appropriate visual features for recognition of visual speech. Human lip-readers are increasingly being presented as useful in the gathering of forensic evidence but, like all human, suffer from unreliability in analyzing the lip movement. Here we used a custom dataset and design the system in such a way that it predicts the output for the lip reading. The problem of speaker independent lip-reading is very demanding due to unpredictable variations between people. Also due to recent developments and advances in the fields of signal processing and computer vision. The task of automating the lip reading is becoming a field of great interest. Here we use MFCC techniques for audio processing and LSTM method for visual speech recognition and finally integrate the audio and video using feed forward neural network (FFNN) and also got good accuracy. That is why the AVSR technique attract a great attention as a reliable solution for the speech detection problem. The final model was capable of taking more appropriate decision while predicting the spoken word. We were able to get a good accuracy of about 92.38% for the final model.

show abstract