2017
DOI: 10.1007/978-3-319-67220-5_10
|View full text |Cite
|
Sign up to set email alerts
|

Speaker Diarization Using Deep Recurrent Convolutional Neural Networks for Speaker Embeddings

Abstract: In this paper we propose a new method of speaker diarization that employs a deep learning architecture to learn speaker embeddings. In contrast to the traditional approaches that build their speaker embeddings using manually hand-crafted spectral features, we propose to train for this purpose a recurrent convolutional neural network applied directly on magnitude spectrograms. To compare our approach with the state of the art, we collect and release for the public an additional dataset of over 6 hours of fully … Show more

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
22
0

Year Published

2018
2018
2021
2021

Publication Types

Select...
3
2
2
1

Relationship

0
8

Authors

Journals

citations
Cited by 23 publications
(22 citation statements)
references
References 35 publications
0
22
0
Order By: Relevance
“…AMI: To evaluate the performance of the proposed embeddings on the speaker diarization task, we use a subset of the AMI meeting corpus [19] that is frequently used for evaluating diarization performance [20,21]. It consists of audio recordings from 26 meetings.…”
Section: Datasetsmentioning
confidence: 99%
“…AMI: To evaluate the performance of the proposed embeddings on the speaker diarization task, we use a subset of the AMI meeting corpus [19] that is frequently used for evaluating diarization performance [20,21]. It consists of audio recordings from 26 meetings.…”
Section: Datasetsmentioning
confidence: 99%
“…While we do not study the mediation of meaning, we do claim to capture "those elements of speech that are not embeddings has been approached with different neural network architectures such as Siamese [6], fully connected [43], and CNN [24,25,10]. Only very recently have RNNs been used successfully [33,23,44].…”
Section: Related Workmentioning
confidence: 99%
“…Cyrta et al [33] suggest to learn the embeddings by training a recurrent convolutional neural network for the task of speaker classification. Although they utilize recurrent layers to retrieve temporal information, the feature extraction is still done by convolutional layers.…”
Section: Related Workmentioning
confidence: 99%
“…In [16], a conditional variational autoencoder is proposed in order to generate prosodic features for speech synthesis by sampling prosodic embeddings from the bottleneck representation. Specifically for separation tasks, speaker-discriminative embeddings are produced for targeted voice separation in [6] and for diarization in [17] yielding a significant improvement over the unconditional separation framework. Recent works [18,19] have utilized conditional embeddings for each music class in order to boost the performance of a deep attractor-network [20] for music separation.…”
Section: Introductionmentioning
confidence: 99%