Interspeech 2019 2019
DOI: 10.21437/interspeech.2019-2618
|View full text |Cite
|
Sign up to set email alerts
|

MobiLipNet: Resource-Efficient Deep Learning Based Lipreading

Abstract: Recent works in visual speech recognition utilize deep learning advances to improve accuracy. Focus however has been primarily on recognition performance, while ignoring the computational burden of deep architectures. In this paper we address these issues concurrently, aiming at both high computational efficiency and recognition accuracy in lipreading. For this purpose, we investigate the MobileNet convolutional neural network architectures, recently proposed for image classification. In addition, we extend th… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1

Citation Types

0
4
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
4
2

Relationship

0
6

Authors

Journals

citations
Cited by 6 publications
(4 citation statements)
references
References 27 publications
0
4
0
Order By: Relevance
“…Recently, Feng et al [127] improved this architecture by integrating the Squeeze-and-Extract [128] module. Besides VGG and ResNet, researchers have also adopted other representative 2D CNN architectures, including DenseNet [58], ShuffleNet [52], MobileNet [129], etc.…”
Section: Visual Frontend Networkmentioning
confidence: 99%
“…Recently, Feng et al [127] improved this architecture by integrating the Squeeze-and-Extract [128] module. Besides VGG and ResNet, researchers have also adopted other representative 2D CNN architectures, including DenseNet [58], ShuffleNet [52], MobileNet [129], etc.…”
Section: Visual Frontend Networkmentioning
confidence: 99%
“…This model consists of spatiotemporal convolutions and recurrent operations, and that is trained by a connectionist temporal classification loss [18]. MobiLipNet [19] has been proposed to achieve computationally efficient lip reading, and that uses the depthwise convolution and the pointwise convolution. There are some prior works based on a generative adversarial network (GAN) [20] for lip reading.…”
Section: Related Workmentioning
confidence: 99%
“…and battery consumption are also important factors. As a consequence, few works have also focused on the computational complexity of visual speech recognition [11,12], but such models still trail massively behind full-fledged ones in terms of accuracy.…”
Section: Introductionmentioning
confidence: 99%
“…Such models consist of fully connected [4,5,6,7,8] or convolutional layers [9,10,11,12] which extract features from the mouth region of interest, followed by recurrent layers or attention [12,13] / self-attention architectures [11]. Few works have also focused on the computational complexity of visual speech recognition [14,15], and in any case efficient methods have trailed massively behind full-fledged ones in terms of accuracy.…”
Section: Introductionmentioning
confidence: 99%