Speaker Adaptive Audio-Visual Fusion for the Open-Vocabulary Section of AVICAR

Sarı, Leda; Hasegawa‐Johnson, Mark; Kumaran, Senthil; Stemmer, Georg; Nair, Krishnakumar N.

doi:10.21437/interspeech.2018-2359

Cited by 1 publication

(1 citation statement)

References 14 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The field of visual speech recognition (VSR), or lipreading, has witnessed dramatic breakthroughs recently, primarily due to the paradigm shift from hand-crafted features to deep learning based models [1][2][3][4][5][6][7][8], coupled with the public release of large suitable corpora in a variety of environments [9][10][11][12][13][14][15], as also reviewed in [16,17]. Such models however, while reducing recognition errors compared to previous approaches, are not as efficient to compute and store.…”

Section: Introductionmentioning

confidence: 99%

MobiLipNet: Resource-Efficient Deep Learning Based Lipreading

Koumparoulis

Potamianos

2019

Interspeech 2019

View full text Add to dashboard Cite

Recent works in visual speech recognition utilize deep learning advances to improve accuracy. Focus however has been primarily on recognition performance, while ignoring the computational burden of deep architectures. In this paper we address these issues concurrently, aiming at both high computational efficiency and recognition accuracy in lipreading. For this purpose, we investigate the MobileNet convolutional neural network architectures, recently proposed for image classification. In addition, we extend the 2D convolutions of MobileNets to 3D ones, in order to better model the spatio-temporal nature of the lipreading problem. We investigate two architectures in this extension, introducing the temporal dimension as part of either the depthwise or the pointwise MobileNet convolutions. To further boost computational efficiency, we also consider using pointwise convolutions alone, as well as networks operating on half the mouth region. We evaluate the proposed architectures on speaker-independent visual-only continuous speech recognition on the popular TCD-TIMIT corpus. Our best system outperforms a baseline CNN by 4.27% absolute in word error rate and over 12 times in computational efficiency, whereas, compared to a state-of-the-art ResNet, it is 37 times more efficient at a minor 0.07% absolute error rate degradation.

show abstract