2019
DOI: 10.1007/978-3-030-20890-5_3
|View full text |Cite
|
Sign up to set email alerts
|

GhostVLAD for Set-Based Face Recognition

Abstract: The objective of this paper is to learn a compact representation of image sets for template-based face recognition. We make the following contributions: first, we propose a network architecture which aggregates and embeds the face descriptors produced by deep convolutional neural networks into a compact fixed-length representation. This compact representation requires minimal memory storage and enables efficient similarity computation. Second, we propose a novel GhostVLAD layer that includes ghost clusters, th… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
50
0

Year Published

2019
2019
2022
2022

Publication Types

Select...
5
4

Relationship

1
8

Authors

Journals

citations
Cited by 67 publications
(50 citation statements)
references
References 40 publications
(83 reference statements)
0
50
0
Order By: Relevance
“…Therefore, while aggregating the frame-level features, the contribution of the noisy and undesirable sections of a speech segment to normal VLAD clusters is effectively downweighted, as most of their weights have been assigned to the 'ghost cluster'. For further details, please see [17].…”
Section: Methodsmentioning
confidence: 99%
See 1 more Smart Citation
“…Therefore, while aggregating the frame-level features, the contribution of the noisy and undesirable sections of a speech segment to normal VLAD clusters is effectively downweighted, as most of their weights have been assigned to the 'ghost cluster'. For further details, please see [17].…”
Section: Methodsmentioning
confidence: 99%
“…We make the following contributions: (i) We propose a powerful speaker recognition deep network, based on a NetVLAD [16] or GhostVLAD [17] layer that is used to aggregate 'thin-ResNet' architecture frame features; (ii) The entire network is trained end-to-end using a large margin softmax loss on the large-scale VoxCeleb2 [3] dataset, and achieves a significant improvement over the current stateof-the-art verification performance on VoxCeleb1, despite using fewer parameters than the current state-of-the-art architectures [3,13]; and, (iii) We analyse the effect of input segment length on performance, and conclude that for 'in the wild' sequences having longer utterances (4s or more) is a significant improvement over shorter segments.…”
Section: Introductionmentioning
confidence: 99%
“…The first model, following [17], which is a LSTM-based neural architecture, maps a sequence of variable-length mel-spectrogram frames to a fixed-dimensional embedding and is trained using the generalized end-to-end(GE2E) loss function. The second model, ThinResNet [16] is composed of a modified ResNet architecture, which extracts the features from the spectrogram of a speech utterance, and a NetVLAD/GhostVLAD [18] layer to aggregate the features along the temporal axis to an embedding. Both models are trained on the VoxCeleb2 [19] dataset, which is comprised of 1 million utterances from 6000 different speakers, and can achieve 4.51%, 3.34% EER respectively on the test-clean split of LibriSpeech [12] dataset.…”
Section: Speaker Informationmentioning
confidence: 99%
“…2.2.1). In a common approach, multiple face captures of the same person are combined to gain robustness against poses, expression, and quality variations [21]. In our scenario, the group is composed of unique faces of different persons.…”
Section: Face Recognitionmentioning
confidence: 99%