Exploring ROI size in deep learning based lipreading

Koumparoulis, Alexandros; Potamianos, Gerasimos; Mroueh, Youssef; Rennie, Steven J.

doi:10.21437/avsp.2017-13

Cited by 21 publications

(12 citation statements)

References 29 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Nowadays, most lipreading research is to extract a fixed size of lip ROI as input, and the size of this size is still an open problem. Koumparoulis et al [207] proved in the experiment that the selection of ROI of different sizes of lips will have an impact on the final recognition results, but still cannot determine the optimal ROI size selection scheme. Many people talk.…”

Section: Discussionmentioning

confidence: 99%

A Survey of Research on Lipreading Technology

et al. 2020

View full text Add to dashboard Cite

Section: Discussionmentioning

confidence: 99%

A Survey of Research on Lipreading Technology

et al. 2020

View full text Add to dashboard Cite

“…LipNet is the closest model to our neural network. Several similar architectures were subsequently introduced in the works of Thanda & Venkatesan (2017) who study audio-visual feature fusion, Koumparoulis et al (2017) who work on a small subset of 18 phonemes and 11 words to predict digit sequences, and Xu et al (2018) who presented a model cascading CTC with attention. were the first to use sequence-to-sequence models with attention to tackle audiovisual speech recognition with a real-world dataset.…”

Section: Related Workmentioning

confidence: 99%

Large-Scale Visual Speech Recognition

Shillingford¹,

Assael²,

Hoffman³

et al. 2019

Interspeech 2019

101

View full text Add to dashboard Cite

This work presents a scalable solution to open-vocabulary visual speech recognition. To achieve this, we constructed the largest existing visual speech recognition dataset, consisting of pairs of text and video clips of faces speaking (3,886 hours of video). In tandem, we designed and trained an integrated lipreading system, consisting of a video processing pipeline that maps raw video to stable videos of lips and sequences of phonemes, a scalable deep neural network that maps the lip videos to sequences of phoneme distributions, and a production-level speech decoder that outputs sequences of words. The proposed system achieves a word error rate (WER) of 40.9% as measured on a held-out set. In comparison, professional lipreaders achieve either 86.4% or 92.9% WER on the same dataset when having access to additional types of contextual information. Our approach significantly improves on other lipreading approaches, including variants of LipNet and of Watch, Attend, and Spell (WAS), which are only capable of 89.8% and 76.8% WER respectively. * These authors contributed equally to this work.

show abstract

“…Further, except for the pointwise-only network, all networks keep the conventional convolution in their first layer, i.e., without factorizing it to depthwise and pointwise ones. All models (including the baseline and ResNet) are trained on the same ROI size, as this affects both performance [45] and efficiency. In more detail, the following models are considered:…”

Section: Network Consideredmentioning

confidence: 99%

MobiLipNet: Resource-Efficient Deep Learning Based Lipreading

Koumparoulis

Potamianos

2019

Interspeech 2019

Self Cite

View full text Add to dashboard Cite

Recent works in visual speech recognition utilize deep learning advances to improve accuracy. Focus however has been primarily on recognition performance, while ignoring the computational burden of deep architectures. In this paper we address these issues concurrently, aiming at both high computational efficiency and recognition accuracy in lipreading. For this purpose, we investigate the MobileNet convolutional neural network architectures, recently proposed for image classification. In addition, we extend the 2D convolutions of MobileNets to 3D ones, in order to better model the spatio-temporal nature of the lipreading problem. We investigate two architectures in this extension, introducing the temporal dimension as part of either the depthwise or the pointwise MobileNet convolutions. To further boost computational efficiency, we also consider using pointwise convolutions alone, as well as networks operating on half the mouth region. We evaluate the proposed architectures on speaker-independent visual-only continuous speech recognition on the popular TCD-TIMIT corpus. Our best system outperforms a baseline CNN by 4.27% absolute in word error rate and over 12 times in computational efficiency, whereas, compared to a state-of-the-art ResNet, it is 37 times more efficient at a minor 0.07% absolute error rate degradation.

show abstract

Exploring ROI size in deep learning based lipreading

Cited by 21 publications

References 29 publications

A Survey of Research on Lipreading Technology

A Survey of Research on Lipreading Technology

Large-Scale Visual Speech Recognition

MobiLipNet: Resource-Efficient Deep Learning Based Lipreading

Contact Info

Product

Resources

About