Two-Level Attention-based Fusion Learning for RGB-D Face Recognition

Uppal, Hardik; Sepas-Moghaddam, Alireza; Greenspan, Michael; Etemad, Ali

doi:10.1109/icpr48806.2021.9412514

Cited by 11 publications

(10 citation statements)

References 22 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The obtained embeddings were finally fused to feed an SVM classifier for performing FR. Jiang et al [38] presented an 2013 Hand-crafted HOG RDF Feature-level IIIT-D [25] 2013 Hand-crafted ICP, DCS SRC N/A IIIT-D [34] 2014 Hand-crafted PCA, LBP, SIFT, LGBP kNN Score-level Kinect Face [27] 2014 Hand-crafted RISE+HOG RDF Feature-level IIIT-D [24] 2016 Hand-crafted ICP SDF N/A Lock3DFace [35] 2016 Hand-crafted Covariance matrix rep. SVM Score-level CurtinFaces [28] 2016 Deep learning Autoencoder Softmax Score-level Kinect Face [36] 2018 Deep learning Siamese CNN Softmax Feature-level Pandora [30] 2018 Deep learning 9 Layers CNN + Inception Softmax Feature-level VAP, IIIT-D, Lock3DFace [37] 2018 Deep learning Fine-tuned VGG-Face Softmax Feature-level LFFD [38] 2018 Deep learning Custom CNN Attribute-aware loss Feature-level Private dataset [39] 2018 Deep learning Inception-v2 Softmax Feature-level IIIT-D, Lock3DFace [40] 2019 Deep learning 14 layers CNN + Attention Softmax Feature-level Lock3DFace [32] 2020 Deep learning CNN + two-level attention Softmax Feature-level IIIT-D, CurtinFaces [41] 2020 Deep learning Custom CNN Assoc., Discrim., and Softmax Feature-level IIIT-D attribute-aware loss function for CNN-based FR which aims to regularize the distribution of the learned feature vectors with respect to some soft-biometric attributes such as gender, ethnicity, and age, thus boosting FR results. Cui et al [39] estimated the depth from RGB modality using a multi-task approach including face identification along with depth estimation.…”

Section: A Rgb-d Face Recognition Methodsmentioning

confidence: 99%

“…Mu et al [40] proposed adding an attention weight map to each feature map, computed from RGB and depth modalities, thus focusing on the most important pixels with respect to their locations during training. Uppal et al [32] used both spatial and channel information from depth and RGB images and fused the information using a two-step attention mechanism. The attention modules assign weights to features, choosing between features from depth and RGB and hence utilize the information from both data modalities effectively.…”

Section: B Attention Mechanismsmentioning

confidence: 99%

“…As discussed in Section II-C, some attention mechanisms have used pure RGB features to focus attention [11], [71], [12], while others [32], [40] have explored the possibility of attention-aware fusion in facial recognition, so as to fuse depth and RGB modalities together. In contrast, our proposed solution (Figure 1) multiplies the attention weights derived from the depth-guided attention mechanism by the RGB feature maps extracted from the CNN to obtain the final set of salient features.…”

Section: Proposed Depth-guided Attentionmentioning

confidence: 99%

“…Implementation Details 1) Preprocessing: Before feeding the images to the network, both RGB and depth images are cropped using the dlib CNN [74] face-extractor network. For unprocessed depth images, we determine two depth values that respectively represent the near and far clipping planes of the scene and filter out the scene content that is either too near or too far from the camera, keeping only depth values which represent the face depth data as suggested in [32]. Following this process, we then normalize the remaining content to fall within the values 2) Network parameters: The optimal parameter values to achieve the best recognition performance have been empirically obtained and are summarized in Table IV.…”

Section: B Test Protocolsmentioning

confidence: 99%

“…The performance of our solution has been tested on four prominent RGB-D face datasets, including Lock3DFace [24], CurtinFaces [25], IIIT-D RGB-D [26], [27], and Kas-pAROV [28], [29], and has been compared to a number of state-of-the-art RGB-D FR methods [30], [31], [32]. The results reveal that our proposed solution consistently learns better person-specific face representations as evidenced by the improved performance in different FR tasks under different challenging conditions.…”

Section: Introductionmentioning

confidence: 99%

See 4 more Smart Citations

Depth as Attention for Face Representation Learning

Uppal

Sepas-Moghaddam

Greenspan

et al. 2021

IEEE Trans.Inform.Forensic Secur.

Self Cite

View full text Add to dashboard Cite

Face representation learning solutions have recently achieved great success for various applications such as verification and identification. However, face recognition approaches that are based purely on RGB images rely solely on intensity information, and therefore are more sensitive to facial variations, notably pose, occlusions, and environmental changes such as illumination and background. A novel depth-guided attention mechanism is proposed for deep multi-modal face recognition using low-cost RGB-D sensors. Our novel attention mechanism directs the deep network "where to look" for visual features in the RGB image by focusing the attention of the network using depth features extracted by a Convolution Neural Network (CNN). The depth features help the network focus on regions of the face in the RGB image that contain more prominent person-specific information. Our attention mechanism then uses this correlation to generate an attention map for RGB images from the depth features extracted by the CNN. We test our network on four public datasets, showing that the features obtained by our proposed solution yield better results on the Lock3DFace, CurtinFaces, IIIT-D RGB-D, and KaspAROV datasets which include challenging variations in pose, occlusion, illumination, expression, and time lapse. Our solution achieves average (increased) accuracies of 87.3% (+5.0%), 99.1% (+0.9%), 99.7% (+0.6%) and 95.3%(+0.5%) for the four datasets respectively, thereby improving the state-of-the-art. We also perform additional experiments with thermal images, instead of depth images, showing the high generalization ability of our solution when adopting other modalities for guiding the attention mechanism instead of depth information.

show abstract