Audio visual speech recognition with multimodal recurrent neural networks

Feng, Weijiang; Guan, Naiyang; Li, Yuan; Zhang, Xiang; Luo, Zhigang

doi:10.1109/ijcnn.2017.7965918

Cited by 60 publications

(28 citation statements)

References 19 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The best recognition accuracy is 93.33% when using BiLSTM with early integrated audio-visual feature and enhancement form audio-only up to 8.33 %, which proved that our proposed model gives better recognition accuracy than that obtained in [40] which gives an accuracy of 87.7% for audio-visual using the same dataset as shown in table 4. DCT is used for the input image to extract the main important features then selecting the main important features using zigzag scanning (minimized numbers of features) then feeding these features to the BiLSTM classifier to perform videoonly speech recognition.…”

Section: Figure 8: Avletters Confusion Matrix With Audio-visual Featusupporting

confidence: 51%

Noise-Robust Speech Recognition System based on Multimodal Audio-Visual Approach Using Different Deep Learning Classification Techniques

Elmaghraby

Gody

Farouk

2020

The Egyptian Journal of Language Engineering

View full text Add to dashboard Cite

Multimodal speech recognition is proved to be one of the most promising solutions for designing a robust speech recognition system, especially when the audio signal is affected by noise. The visual signal can be used to obtain more information to enhance the recognition accuracy in a noisy system, whereas the reliability of the visual signal is not affected by the acoustic noise. The critical stage in designing a robust speech recognition system is the choice of an appropriate feature extraction method for both audio and visual signals and the choice of a reliable classification method from a large variety of existing classification techniques. This paper extends an earlier work on designing a speech recognition system based on Hidden Markov Model (HMM) classification technique of using visual modality in addition to audio modality[1]. Improved off traditional HMM-based Automatic Speech Recognition (ASR) accuracy is achieved by implementing a technique using either RNN-based or CNN-based approach. This research is intending to deliver two contributions: The first contribution is the methodology of choosing the visual features by comparing different visual features extraction methods like Discrete Cosine Transform (DCT), blocked DCT, and Histograms of Oriented Gradients with Local Binary Patterns (HOG+LBP), and applying different dimension reduction techniques like Principal Component Analysis (PCA), auto-encoder, Linear Discriminant Analysis (LDA), t-distributed Stochastic Neighbor Embedding (t-SNE) to find the most effective features vector size. Then the obtained visual features are early integrated with the audio features obtained by using Mel Frequency Cepstral Coefficients (MFCCs) and the combined audiovisual feature vector is fed to the classification process. The second contribution of this research is the methodology of developing the classification process using deep learning by comparing different Deep Neural Network (DNN) architectures like Bidirectional Long-Short Term Memory (BiLSTM) and Convolution Neural Network (CNN) with the traditional HMM. The proposed model is evaluated on two multispeakers AV-ASR datasets named AVletters 1 and GRID 2 with different SNRs. The model performs speaker-independent experiments in AVlettter dataset and speaker-dependent in GRID dataset. The experimental results obtained in this research showed that using early integration between audio features obtained by MFCC and visual features obtained by DCT with zigzag scanning demonstrate higher recognition accuracy when used with BiLSTM classifier compared to other methods for features extraction, dimension reduction, and classification techniques.

show abstract

Section: Figure 8: Avletters Confusion Matrix With Audio-visual Featusupporting

confidence: 51%

Noise-Robust Speech Recognition System based on Multimodal Audio-Visual Approach Using Different Deep Learning Classification Techniques

Elmaghraby

Gody

Farouk

2020

The Egyptian Journal of Language Engineering

View full text Add to dashboard Cite

show abstract

“…Denoting the input layer, hidden layer and output layer at time t as X (t) , h (t) and o (t) respectively. Where U,V, and W are the weighting matrices of the input-to-hidden connection, hidden-to-output connection, and hidden-to-hidden connection, respectively [52]. In RNNs, the connections between the nodes are performed with a redirected loop and sequential events can be interpreted with each other [44,53].…”

Section: Deep Belief Network (Dbns)mentioning

confidence: 99%

Deep learning techniques of losses in data transmitted in wireless sensor networks

Ersoy¹,

Aksoy²

2021

Turk J Elec Eng & Comp Sci

View full text Add to dashboard Cite

Wireless sensor network systems are frequently used today along with the rapidly developing technology. Wireless sensor networks, which form the basis of the Internet of Things, have a wide user base in the world. For example, wireless sensor networks are used in many areas from education to health, from military applications to home applications. It enables the data obtained from the sensors to be transferred between nodes with the help of end-to-end wireless protocols. In parallel with the increasing number of nodes in WSN, data traffic density also increases. Due to the limitations of the WSN network, lost packet rates also increase with increasing data traffic. In this study, a data set was created by examining the data transfers of different amounts of WSN nodes placed in different places. The effects of the number of sensors and the distance between them were obtained from the data set. In this study, a data set was created by collecting the data from the sensor nodes placed at 1500m x 1500m intervals in the ns-3 discrete event emulator program. Today, with the rapid development of technology, deep learning methods which are one of the artificial intelligence methods, are also used in WSN. In this study, the loss rate in the transferred data packets was tried to be estimated with the highest accuracy by using Deep Belief Network (DBN), Recurrent Neural Network (RNN) and Deep Neural Network (DNN) deep learning techniques over the obtained dataset. Of these three deep learning methods, DNN deep learning method was found to accurately estimate the loss rate in the transferred data packets with an accuracy rate of 88.50%.

show abstract

“…There is increased interest in using neural networks for multi-modal fusion of auditory and visual signals to solve various speech-related problems. These include audio-visual speech recognition [Feng et al 2017;Mroueh et al 2015;Ngiam et al 2011], predicting speech or text from silent video (lipreading) [Chung et al 2016;Ephrat et al 2017], and unsupervised learning of language from visual and speech signals [Harwath et al 2016]. These methods leverage natural synchrony between simultaneously recorded visual and auditory signals.…”

Section: Related Workmentioning

confidence: 99%

Looking to listen at the cocktail party

et al. 2018

View full text Add to dashboard Cite

Fig. 1. We present a model for isolating and enhancing the speech of desired speakers in a video. (a) The input is a video (frames + audio track) with one or more people speaking, where the speech of interest is interfered by other speakers and/or background noise. (b) Both audio and visual features are extracted and fed into a joint audio-visual speech separation model. The output is a decomposition of the input audio track into clean speech tracks, one for each person detected in the video (c). This allows us to then compose videos where speech of specific people is enhanced while all other sound is suppressed. Our model was trained using thousands of hours of video segments from our new dataset, AVSpeech. The "Stand-Up" video (a) is courtesy of Team Coco.We present a joint audio-visual model for isolating a single speech signal from a mixture of sounds such as other speakers and background noise. Solving this task using only audio as input is extremely challenging and does not provide an association of the separated speech signals with speakers in the video. In this paper, we present a deep network-based model that incorporates both visual and auditory signals to solve this task. The visual features are used to "focus" the audio on desired speakers in a scene and to improve the speech separation quality. To train our joint audio-visual model, we introduce AVSpeech, a new dataset comprised of thousands of hours of video segments from the Web. We demonstrate the applicability of our method to classic speech separation tasks, as well as real-world scenarios involving heated interviews, noisy bars, and screaming children, only requiring the user to specify the face of the person in the video whose speech they want to isolate. Our method shows clear advantage over stateof-the-art audio-only speech separation in cases of mixed speech. In addition, our model, which is speaker-independent (trained once, applicable to any speaker), produces better results than recent audio-visual speech separation Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the owner/author(s).

show abstract

Audio visual speech recognition with multimodal recurrent neural networks

Cited by 60 publications

References 19 publications

Noise-Robust Speech Recognition System based on Multimodal Audio-Visual Approach Using Different Deep Learning Classification Techniques

Noise-Robust Speech Recognition System based on Multimodal Audio-Visual Approach Using Different Deep Learning Classification Techniques

Deep learning techniques of losses in data transmitted in wireless sensor networks

Looking to listen at the cocktail party

Contact Info

Product

Resources

About