Concatenated Frame Image Based CNN for Visual Speech Recognition

Saitoh, Tohru; Zhou, Ziheng; Zhao, Guoying

doi:10.1007/978-3-319-54427-4_21

Cited by 35 publications

(31 citation statements)

References 14 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Results are shown in Table 3. This dataset is balanced so we just report the classification rate which is the default performance measure for this database [25]. The best performance in video-only experiments is achieved by the frontal and profile views followed by the 45°, 30°and 60°views.…”

Section: Results On Ouluvs2 Databasementioning

confidence: 99%

See 1 more Smart Citation

End-to-End Audiovisual Fusion with LSTMs

Petridis¹,

Wang²,

Li³

et al. 2017

The 14th International Conference on Auditory-Visual Speech Processing

View full text Add to dashboard Cite

Several end-to-end deep learning approaches have been recently presented which simultaneously extract visual features from the input images and perform visual speech classification. However, research on jointly extracting audio and visual features and performing classification is very limited. In this work, we present an end-to-end audiovisual model based on Bidirectional Long Short-Term Memory (BLSTM) networks. To the best of our knowledge, this is the first audiovisual fusion model which simultaneously learns to extract features directly from the pixels and spectrograms and perform classification of speech and nonlinguistic vocalisations. The model consists of multiple identical streams, one for each modality, which extract features directly from mouth regions and spectrograms. The temporal dynamics in each stream/modality are modeled by a BLSTM and the fusion of multiple streams/modalities takes place via another BLSTM. An absolute improvement of 1.9% in the mean F1 of 4 nonlingusitic vocalisations over audio-only classification is reported on the AVIC database. At the same time, the proposed end-to-end audiovisual fusion system improves the state-of-theart performance on the AVIC database leading to a 9.7% absolute increase in the mean F1 measure. We also perform audiovisual speech recognition experiments on the OuluVS2 database using different views of the mouth, frontal to profile. The proposed audiovisual system significantly outperforms the audioonly model for all views when the acoustic noise is high.

show abstract

Section: Results On Ouluvs2 Databasementioning

confidence: 99%

“…The protocol suggested in [25] is used for the OuluVS2 dataset where 40 subjects are used for training and validation and 12 for testing. We randomly divided the 40 subjects into 35 and 5 subjects for training and validation purposes, respectively.…”

Section: Evaluation Protocolmentioning

confidence: 99%

End-to-End Audiovisual Fusion with LSTMs

Petridis¹,

Wang²,

Li³

et al. 2017

The 14th International Conference on Auditory-Visual Speech Processing

View full text Add to dashboard Cite

show abstract

“…The same 5 views have been evaluated on the OuluVS2 database and the results are still conflicting. The frontal view was found to be the best by Saitoh et al [29], the profile view by Lee at al. [19], the 30 • view by Zimmermann et al [34] and the 60 • view was found to be the best performing in Three different convolutional neural networks (CNNs), GoogLeNet, AlexNet and Network in Network, were trained on OuluVS2 using data augmentation.…”

Section: Related Workmentioning

confidence: 92%

“…We first partition the data into training, validation and test sets. The protocol suggested by the creators of the OuluVS2 database is used [29] where 40 subjects are used for training and validation and 12 for testing. We randomly divided the 40 subjects into 35 and 5 subjects for training and validation purposes, respectively.…”

Section: Methodsmentioning

confidence: 99%

End-to-End Multi-View Lipreading

Petridis¹,

Wang²,

Li³

et al. 2017

Procedings of the British Machine Vision Conference 2017

View full text Add to dashboard Cite

Non-frontal lip views contain useful information which can be used to enhance the performance of frontal view lipreading. However, the vast majority of recent lipreading works, including the deep learning approaches which significantly outperform traditional approaches, have focused on frontal mouth images. As a consequence, research on joint learning of visual features and speech classification from multiple views is limited. In this work, we present an end-to-end multi-view lipreading system based on Bidirectional Long-Short Memory (BLSTM) networks. To the best of our knowledge, this is the first model which simultaneously learns to extract features directly from the pixels and performs visual speech classification from multiple views and also achieves state-of-the-art performance. The model consists of multiple identical streams, one for each view, which extract features directly from different poses of mouth images. The temporal dynamics in each stream/view are modelled by a BLSTM and the fusion of multiple streams/views takes place via another BLSTM. An absolute average improvement of 3% and 3.8% over the frontal view performance is reported on the OuluVS2 database when the best two (frontal and profile) and three views (frontal, profile, 45 • ) are combined, respectively. The best three-view model results in a 10.5% absolute improvement over the current multi-view state-of-the-art performance on OuluVS2, without using external databases for training, achieving a maximum classification accuracy of 96.9%.

show abstract

“…Thus far, methods applying a time series image or video data as the input of a CNN have been proposed [29,30]. Saitoh et al proposed a sequence image representation, namely a concatenated frame image (CFI) and a CFI-based CNN model for visual speech recognition [30]. A CFI is formed by concatenating frames sampled at uniform intervals from a video sequence.…”

Section: Input Imagementioning

confidence: 99%

Convolutional Neural Network based Estimation of Gel-like Food Texture by a Robotic Sensing System

et al. 2017

View full text Add to dashboard Cite

This paper presents a robotic sensing system that evaluates the texture of gel-like food, in which not only mechanical characteristics, but also geometrical characteristics of the texture are objectively and quantitatively evaluated. When a human chews a gel-like food, the person perceives the changes in the shape and contact force simultaneously on the tongue. Based on their impression, they evaluate the texture. To reproduce this procedure using a simple artificial mastication robot, the pressure distribution of the gel-like food is measured, and the information associated with both the geometrical and mechanical characteristics is simultaneously acquired. The relationship between the value of the human sensory evaluation of the texture and the pressure distribution image is then modeled by applying a convolutional neural network. Experimental results show that the proposed system succeeds in estimating the values of a human sensory evaluation for 23 types of gel-like food with a coefficient of determination greater than 0.92.

show abstract

Concatenated Frame Image Based CNN for Visual Speech Recognition

Cited by 35 publications

References 14 publications

End-to-End Audiovisual Fusion with LSTMs

End-to-End Audiovisual Fusion with LSTMs

End-to-End Multi-View Lipreading

Convolutional Neural Network based Estimation of Gel-like Food Texture by a Robotic Sensing System

Contact Info

Product

Resources

About