2020
DOI: 10.48550/arxiv.2006.08599
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

"Notic My Speech" -- Blending Speech Patterns With Multimedia

Abstract: Speech as a natural signal is composed of three parts -visemes (visual part of speech), phonemes (spoken part of speech), and language (the imposed structure). However, video as a medium for the delivery of speech and a multimedia construct has mostly ignored the cognitive aspects of speech delivery. For example, video applications like transcoding and compression have till now ignored the fact how speech is delivered and heard. To close the gap between speech understanding and multimedia video applications, i… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1

Citation Types

0
2
0

Year Published

2021
2021
2021
2021

Publication Types

Select...
1

Relationship

0
1

Authors

Journals

citations
Cited by 1 publication
(2 citation statements)
references
References 27 publications
0
2
0
Order By: Relevance
“…iii) We also consider other alternative recognition solutions available in the literature for each experiment. For lip reading, we include PCA+LSTM+HMM [33], CNN+LSTM [16], 4-layer CNN+Hierarchical LSTM [22], and VGG-M+Attentive Bi-LSTM [26], as they use LSTM networks in combination with different spatial feature extractors and classifiers. 3DCNN [16] has also been considered as it has shown to be an effective alternative to CNN + LSTM architectures.…”
Section: Benchmarksmentioning
confidence: 99%
See 1 more Smart Citation
“…iii) We also consider other alternative recognition solutions available in the literature for each experiment. For lip reading, we include PCA+LSTM+HMM [33], CNN+LSTM [16], 4-layer CNN+Hierarchical LSTM [22], and VGG-M+Attentive Bi-LSTM [26], as they use LSTM networks in combination with different spatial feature extractors and classifiers. 3DCNN [16] has also been considered as it has shown to be an effective alternative to CNN + LSTM architectures.…”
Section: Benchmarksmentioning
confidence: 99%
“…To this end, we use the same 4-layer CNN used in [22], thus solely comparing our MP-LSTM network with the hierarchical LSTM network proposed in [22]. The results show the superiority of our MP-LSTM when compared to attentive Bi-LSTM [26] and hierarchical LSTM [22]. Experiment 2 (LF-based Face Recognition): Table 4 presents the face recognition performance when, respectively: i) LFFC is used for training and LFFW for testing (Protocol 1); and ii) LFFW is used for training and LFFC for testing (Protocol 2).…”
Section: Solutionmentioning
confidence: 99%