2017
DOI: 10.1007/978-3-319-54427-4_21
|View full text |Cite
|
Sign up to set email alerts
|

Concatenated Frame Image Based CNN for Visual Speech Recognition

Abstract: This paper proposed a novel sequence image representation method called concatenated frame image (CFI), two types of data augmentation methods for CFI, and a framework of CFI-based convolutional neural network (CNN) for visual speech recognition (VSR) task. CFI is a simple, however, it contains spatial-temporal information of a whole image sequence. The proposed method was evaluated with a public database OuluVS2. This is a multi-view audiovisual dataset recorded from 52 subjects. The speaker independent recog… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

1
30
0

Year Published

2017
2017
2024
2024

Publication Types

Select...
5
2
1

Relationship

0
8

Authors

Journals

citations
Cited by 35 publications
(31 citation statements)
references
References 14 publications
1
30
0
Order By: Relevance
“…Results are shown in Table 3. This dataset is balanced so we just report the classification rate which is the default performance measure for this database [25]. The best performance in video-only experiments is achieved by the frontal and profile views followed by the 45°, 30°and 60°views.…”
Section: Results On Ouluvs2 Databasementioning
confidence: 99%
See 1 more Smart Citation
“…Results are shown in Table 3. This dataset is balanced so we just report the classification rate which is the default performance measure for this database [25]. The best performance in video-only experiments is achieved by the frontal and profile views followed by the 45°, 30°and 60°views.…”
Section: Results On Ouluvs2 Databasementioning
confidence: 99%
“…The protocol suggested in [25] is used for the OuluVS2 dataset where 40 subjects are used for training and validation and 12 for testing. We randomly divided the 40 subjects into 35 and 5 subjects for training and validation purposes, respectively.…”
Section: Evaluation Protocolmentioning
confidence: 99%
“…The same 5 views have been evaluated on the OuluVS2 database and the results are still conflicting. The frontal view was found to be the best by Saitoh et al [29], the profile view by Lee at al. [19], the 30 • view by Zimmermann et al [34] and the 60 • view was found to be the best performing in Three different convolutional neural networks (CNNs), GoogLeNet, AlexNet and Network in Network, were trained on OuluVS2 using data augmentation.…”
Section: Related Workmentioning
confidence: 92%
“…We first partition the data into training, validation and test sets. The protocol suggested by the creators of the OuluVS2 database is used [29] where 40 subjects are used for training and validation and 12 for testing. We randomly divided the 40 subjects into 35 and 5 subjects for training and validation purposes, respectively.…”
Section: Methodsmentioning
confidence: 99%
“…Thus far, methods applying a time series image or video data as the input of a CNN have been proposed [29,30]. Saitoh et al proposed a sequence image representation, namely a concatenated frame image (CFI) and a CFI-based CNN model for visual speech recognition [30]. A CFI is formed by concatenating frames sampled at uniform intervals from a video sequence.…”
Section: Input Imagementioning
confidence: 99%