Domain mismatch refers to the problem in which the distribution of training data differs from that of the test data. This paper proposes a variational domain adversarial neural network (VDANN), which consists of a variational autoencoder (VAE) and a domain adversarial neural network (DANN), to reduce domain mismatch. The DANN part aims to retain speaker identity information and learn a feature space that is robust against domain mismatch, while the VAE part is to impose variational regularization on the learned features so that they follow a Gaussian distribution. Thus, the representation produced by VDANN is not only speaker discriminative and domaininvariant but also Gaussian distributed, which is essential for the standard PLDA backend. Experiments on both SRE16 and SRE18-CMN2 show that VDANN outperforms the Kaldi baseline and the standard DANN. The results also suggest that VAE regularization is effective for domain adaptation.
Domain mismatch is a common problem in speaker verification (SV) and often causes performance degradation. For the system relying on the Gaussian PLDA backend to suppress the channel variability, the performance would be further limited if there is no Gaussianity constraint on the learned embeddings. This paper proposes an information-maximized variational domain adversarial neural network (InfoVDANN) that incorporates an InfoVAE into domain adversarial training (DAT) to reduce domain mismatch and simultaneously meet the Gaussianity requirement of the PLDA backend. Specifically, DAT is applied to produce speaker discriminative and domain-invariant features, while the InfoVAE performs variational regularization on the embedded features so that they follow a Gaussian distribution. Another benefit of the InfoVAE is that it avoids posterior collapse in VAEs by preserving the mutual information between the embedded features and the training set so that extra speaker information can be retained in the features. Experiments on both SRE16 and SRE18-CMN2 show that the InfoVDANN outperforms the recent VDANN, which suggests that increasing the mutual information between the embedded features and input features enables the InfoVDANN to extract extra speaker information that is otherwise not possible.
State-of-the-art speaker verification systems take frame-level acoustics features as input and produce fixed-dimensional embeddings as utterance-level representations. Thus, how to aggregate information from frame-level features is vital for achieving high performance. This paper introduces short-time spectral pooling (STSP) for better aggregation of frame-level information. STSP transforms the temporal feature maps of a speaker embedding network into the spectral domain and extracts the lowest spectral components of the averaged spectrograms for aggregation. Benefiting from the low-pass characteristic of the averaged spectrograms, STSP is able to preserve most of the speaker information in the feature maps using a few spectral components only. We show that statistics pooling is a special case of STSP where only the DC spectral components are used. Experiments on VoxCeleb1 and VOiCES 2019 show that STSP outperforms statistics pooling and multi-head attentive pooling, which suggests that leveraging more spectral information in the CNN feature maps can produce highly discriminative speaker embeddings.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.