2017 International Joint Conference on Neural Networks (IJCNN) 2017
DOI: 10.1109/ijcnn.2017.7965918
|View full text |Cite
|
Sign up to set email alerts
|

Audio visual speech recognition with multimodal recurrent neural networks

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

3
23
0

Year Published

2018
2018
2023
2023

Publication Types

Select...
4
3
3

Relationship

0
10

Authors

Journals

citations
Cited by 60 publications
(28 citation statements)
references
References 19 publications
3
23
0
Order By: Relevance
“…The best recognition accuracy is 93.33% when using BiLSTM with early integrated audio-visual feature and enhancement form audio-only up to 8.33 %, which proved that our proposed model gives better recognition accuracy than that obtained in [40] which gives an accuracy of 87.7% for audio-visual using the same dataset as shown in table 4. DCT is used for the input image to extract the main important features then selecting the main important features using zigzag scanning (minimized numbers of features) then feeding these features to the BiLSTM classifier to perform videoonly speech recognition.…”
Section: Figure 8: Avletters Confusion Matrix With Audio-visual Featusupporting
confidence: 51%
“…The best recognition accuracy is 93.33% when using BiLSTM with early integrated audio-visual feature and enhancement form audio-only up to 8.33 %, which proved that our proposed model gives better recognition accuracy than that obtained in [40] which gives an accuracy of 87.7% for audio-visual using the same dataset as shown in table 4. DCT is used for the input image to extract the main important features then selecting the main important features using zigzag scanning (minimized numbers of features) then feeding these features to the BiLSTM classifier to perform videoonly speech recognition.…”
Section: Figure 8: Avletters Confusion Matrix With Audio-visual Featusupporting
confidence: 51%
“…Denoting the input layer, hidden layer and output layer at time t as X (t) , h (t) and o (t) respectively. Where U,V, and W are the weighting matrices of the input-to-hidden connection, hidden-to-output connection, and hidden-to-hidden connection, respectively [52]. In RNNs, the connections between the nodes are performed with a redirected loop and sequential events can be interpreted with each other [44,53].…”
Section: Deep Belief Network (Dbns)mentioning
confidence: 99%
“…There is increased interest in using neural networks for multi-modal fusion of auditory and visual signals to solve various speech-related problems. These include audio-visual speech recognition [Feng et al 2017;Mroueh et al 2015;Ngiam et al 2011], predicting speech or text from silent video (lipreading) [Chung et al 2016;Ephrat et al 2017], and unsupervised learning of language from visual and speech signals [Harwath et al 2016]. These methods leverage natural synchrony between simultaneously recorded visual and auditory signals.…”
Section: Related Workmentioning
confidence: 99%