"Notic My Speech" -- Blending Speech Patterns With Multimedia

Sahrawat, Dhruva; Kumar, Yaman; Aggarwal, Shashwat; Yin, Yifang; Shah, Rajiv Ratn; Zimmermann, Roger

doi:10.48550/arxiv.2006.08599

Cited by 1 publication

(2 citation statements)

References 27 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…iii) We also consider other alternative recognition solutions available in the literature for each experiment. For lip reading, we include PCA+LSTM+HMM [33], CNN+LSTM [16], 4-layer CNN+Hierarchical LSTM [22], and VGG-M+Attentive Bi-LSTM [26], as they use LSTM networks in combination with different spatial feature extractors and classifiers. 3DCNN [16] has also been considered as it has shown to be an effective alternative to CNN + LSTM architectures.…”

Section: Benchmarksmentioning

confidence: 99%

“…To this end, we use the same 4-layer CNN used in [22], thus solely comparing our MP-LSTM network with the hierarchical LSTM network proposed in [22]. The results show the superiority of our MP-LSTM when compared to attentive Bi-LSTM [26] and hierarchical LSTM [22]. Experiment 2 (LF-based Face Recognition): Table 4 presents the face recognition performance when, respectively: i) LFFC is used for training and LFFW for testing (Protocol 1); and ii) LFFW is used for training and LFFC for testing (Protocol 2).…”

Section: Solutionmentioning

confidence: 99%

See 1 more Smart Citation

Multi-Perspective LSTM for Joint Visual Representation Learning

Sepas-Moghaddam¹,

Pereira²,

Correia³

et al. 2021

Preprint

View full text Add to dashboard Cite

We present a novel LSTM cell architecture capable of learning both intra-and inter-perspective relationships available in visual sequences captured from multiple perspectives. Our architecture adopts a novel recurrent joint learning strategy that uses additional gates and memories at the cell level. We demonstrate that by using the proposed cell to create a network, more effective and richer visual representations are learned for recognition tasks. We validate the performance of our proposed architecture in the context of two multi-perspective visual recognition tasks namely lip reading and face recognition. Three relevant datasets are considered and the results are compared against fusion strategies, other existing multi-input LSTM architectures, and alternative recognition solutions. The experiments show the superior performance of our solution over the considered benchmarks, both in terms of recognition accuracy and complexity. We make our code publicly available at https://github.com/arsm/MPLSTM.

show abstract