The 14th International Conference on Auditory-Visual Speech Processing 2017
DOI: 10.21437/avsp.2017-13
|View full text |Cite
|
Sign up to set email alerts
|

Exploring ROI size in deep learning based lipreading

Abstract: Automatic speechreading systems have increasingly exploited deep learning advances, resulting in dramatic gains over traditional methods. State-of-the-art systems typically employ convolutional neural networks (CNNs), operating on a video region-of-interest (ROI) that contains the speaker's mouth. However, little or no attention has been paid to the effects of ROI physical coverage and resolution on the resulting recognition performance within the deep learning framework. In this paper, we investigate such cho… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
12
0

Year Published

2018
2018
2022
2022

Publication Types

Select...
4
2
1

Relationship

1
6

Authors

Journals

citations
Cited by 21 publications
(12 citation statements)
references
References 29 publications
0
12
0
Order By: Relevance
“…Nowadays, most lipreading research is to extract a fixed size of lip ROI as input, and the size of this size is still an open problem. Koumparoulis et al [207] proved in the experiment that the selection of ROI of different sizes of lips will have an impact on the final recognition results, but still cannot determine the optimal ROI size selection scheme. Many people talk.…”
Section: Discussionmentioning
confidence: 99%
“…Nowadays, most lipreading research is to extract a fixed size of lip ROI as input, and the size of this size is still an open problem. Koumparoulis et al [207] proved in the experiment that the selection of ROI of different sizes of lips will have an impact on the final recognition results, but still cannot determine the optimal ROI size selection scheme. Many people talk.…”
Section: Discussionmentioning
confidence: 99%
“…LipNet is the closest model to our neural network. Several similar architectures were subsequently introduced in the works of Thanda & Venkatesan (2017) who study audio-visual feature fusion, Koumparoulis et al (2017) who work on a small subset of 18 phonemes and 11 words to predict digit sequences, and Xu et al (2018) who presented a model cascading CTC with attention. were the first to use sequence-to-sequence models with attention to tackle audiovisual speech recognition with a real-world dataset.…”
Section: Related Workmentioning
confidence: 99%
“…Further, except for the pointwise-only network, all networks keep the conventional convolution in their first layer, i.e., without factorizing it to depthwise and pointwise ones. All models (including the baseline and ResNet) are trained on the same ROI size, as this affects both performance [45] and efficiency. In more detail, the following models are considered:…”
Section: Network Consideredmentioning
confidence: 99%