A deep neural network for audio-visual person recognition

Alam, Mohammad Rafiqul; Bennamoun, Mohammed; Togneri, Roberto; Sohel, Ferdous

doi:10.1109/btas.2015.7358754

Cited by 7 publications

(10 citation statements)

References 13 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Some studies have attempted to enhance the quality of person recognition from two data sources (audio-visual data) using DBN and DBM [72] models, which have allowed several types of representation to be combined and coordinated. Some of these works include [48], [73]. According to Salakhutdinov et al [72], a DBM is a generative model that includes several layers of hidden variables.…”

Section: Human Recognitionmentioning

confidence: 99%

“…According to Salakhutdinov et al [72], a DBM is a generative model that includes several layers of hidden variables. In [48], the structure of deep multimodal Boltzmann machines (DMBM) [71] is similar to that of DBM, but it can admit more than one modality. Therefore, each modality will be covered individually using adaptive approaches.…”

Section: Human Recognitionmentioning

confidence: 99%

See 1 more Smart Citation

A survey on deep multimodal learning for computer vision: advances, trends, applications, and datasets

et al. 2021

View full text Add to dashboard Cite

The research progress in multimodal learning has grown rapidly over the last decade in several areas, especially in computer vision. The growing potential of multimodal data streams and deep learning algorithms has contributed to the increasing universality of deep multimodal learning. This involves the development of models capable of processing and analyzing the multimodal information uniformly. Unstructured real-world data can inherently take many forms, also known as modalities, often including visual and textual content. Extracting relevant patterns from this kind of data is still a motivating goal for researchers in deep learning. In this paper, we seek to improve the understanding of key concepts and algorithms of deep multimodal learning for the computer vision community by exploring how to generate deep models that consider the integration and combination of heterogeneous visual cues across sensory modalities. In particular, we summarize six perspectives from the current literature on deep multimodal learning, namely: multimodal data representation, multimodal fusion (i.e., both traditional and deep learning-based schemes), multitask learning, multimodal alignment, multimodal transfer learning, and zero-shot learning. We also survey current multimodal applications and present a collection of benchmark datasets for solving problems in various vision domains. Finally, we highlight the limitations and challenges of deep multimodal learning and provide insights and directions for future research.

show abstract

Section: Human Recognitionmentioning

confidence: 99%

Section: Human Recognitionmentioning

confidence: 99%

A survey on deep multimodal learning for computer vision: advances, trends, applications, and datasets

et al. 2021

View full text Add to dashboard Cite

show abstract

“…It is observed that MFCCs have performed better than others. Alam et al [4], [5] have explored the usage of MFCCs in deep neural network based methods. Further, MFCCs are also used in creating i-vectors, which performed better with Linear Discriminant Analysis (LDA) and Within Class Covariance Normalisation (WCCN) [105].…”

Section: ) Cepstral Coefficientsmentioning

confidence: 99%

“…LBP features are extracted on the detected faces for multimodal authentication in [124], [125]. Deep neural network based AV recognition systems [4] employed LBPs as visual features from face images that are photometrically normalized using the Tan-Triggs algorithm [128]. In further research, a joint deep Boltzmann machine (jDBM) model that uses LBPs is introduced with an improved performance [5].…”

Section: ) Texture Based Featuresmentioning

confidence: 99%

Audio-Visual Biometric Recognition and Presentation Attack Detection: A Comprehensive Survey

Mandalapu

Ramachandra

et al. 2021

IEEE Access

View full text Add to dashboard Cite

Biometric recognition is a trending technology that uses unique characteristics data to identify or verify/authenticate security applications. Amidst the classically used biometrics, voice and face attributes are the most propitious for prevalent applications in day-to-day life because they are easy to obtain through restrained and user-friendly procedures. The pervasiveness of low-cost audio and face capture sensors in smartphones, laptops, and tablets has made the advantage of voice and face biometrics more exceptional when compared to other biometrics. For many years, acoustic information alone has been a great success in automatic speaker verification applications. Meantime, the last decade or two has also witnessed a remarkable ascent in face recognition technologies. Nonetheless, in adverse unconstrained environments, neither of these techniques achieves optimal performance. Since audio-visual information carries correlated and complementary information, integrating them into one recognition system can increase the system's performance. The vulnerability of biometrics towards presentation attacks and audio-visual data usage for the detection of such attacks is also a hot topic of research. This paper made a comprehensive survey on existing state-of-the-art audio-visual recognition techniques, publicly available databases for benchmarking, and Presentation Attack Detection (PAD) algorithms. Further, a detailed discussion on challenges and open problems is presented in this field of biometrics.INDEX TERMS Biometrics, audio-visual person recognition, presentation attack detection.

show abstract

“…The majority of them are intended to detect and recognize targets and few are for cognitive development. For instance, some fusion networks learn visual images and sounds respectively using two branches of deep neural network and integrate them by connecting their vectors in series [19]- [21]. But these computational models have fixed topology and need to be trained with enormous data in an offline way.…”

Section: (B) the Process Mainly Involves Audiovisual Integration Andmentioning

confidence: 99%

An Autonomous Developmental Cognitive Architecture Based on Incremental Associative Neural Network With Dynamic Audiovisual Fusion

Huang

Song

et al. 2019

IEEE Access

View full text Add to dashboard Cite

Developing cognition is difficult to achieve yet crucial for robots. Infants can gradually improve their cognition through parental guidance and self-exploration. However, conventional learning methods for robots often focus on a single modality and train a pre-defined model by large datasets in an offline way. In this paper, we propose a hierarchical autonomous cognitive architecture for robots to learn object concepts online by interacting with humans. Two pathways for audiovisual information are devised. Each pathway has three layers based on the self-organizing incremental neural networks. Visual features and names of objects are incrementally learned and self-organized in an unsupervised way in sample layers, respectively, in which we propose a dynamically adjustable similarity threshold strategy to allow the network itself to control cluster rather than using a pre-defined threshold. Two symbol layers abstract the cluster results from the corresponding sample layer to form concise symbols and transmit them to an associative layer. An associative relationship between two modalities can be built in real time by binding activated visual and auditory symbols simultaneously in the associative layer. In this layer, a top-down response strategy is proposed to let robots autonomously recall another associative modality, solve conflicting associative relationships, and adjust learned knowledge from the top down. The experimental results on two objects datasets and a real task show that our architecture is efficient to learn and associate object view and name in an online way. What is more, the robot can autonomously improve its cognitive level by utilizing its own experience without enquiring with humans. INDEX TERMS Cognitive development, concept online learning, self-organizing incremental neural network, object recognition, audiovisual integration.

show abstract

A deep neural network for audio-visual person recognition

Cited by 7 publications

References 13 publications

A survey on deep multimodal learning for computer vision: advances, trends, applications, and datasets

A survey on deep multimodal learning for computer vision: advances, trends, applications, and datasets

Audio-Visual Biometric Recognition and Presentation Attack Detection: A Comprehensive Survey

An Autonomous Developmental Cognitive Architecture Based on Incremental Associative Neural Network With Dynamic Audiovisual Fusion

Contact Info

Product

Resources

About