2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2018
DOI: 10.1109/icassp.2018.8462628
|View full text |Cite
|
Sign up to set email alerts
|

Speaker Diarization with LSTM

Abstract: For many years, i-vector based audio embedding techniques were the dominant approach for speaker verification and speaker diarization applications. However, mirroring the rise of deep learning in various domains, neural network based audio embeddings, also known as d-vectors, have consistently demonstrated superior speaker verification performance. In this paper, we build on the success of dvector based speaker verification systems to develop a new d-vector based approach to speaker diarization. Specifically, … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
165
0
1

Year Published

2019
2019
2022
2022

Publication Types

Select...
5
4

Relationship

0
9

Authors

Journals

citations
Cited by 250 publications
(194 citation statements)
references
References 21 publications
0
165
0
1
Order By: Relevance
“…al need to ask teachers to wear the LENA system during the entire teaching process and use differences in volume and pitch in order to assess when teachers were speaking or students were speaking. Please note that CAD is different from the classic speaker verification [14,15,16] and speaker diarization [17] where (1) there is no enrollment-verification 2-stage process in CAD tasks; and (2) not every speaker need to be identified.…”
Section: Related Workmentioning
confidence: 99%
See 1 more Smart Citation
“…al need to ask teachers to wear the LENA system during the entire teaching process and use differences in volume and pitch in order to assess when teachers were speaking or students were speaking. Please note that CAD is different from the classic speaker verification [14,15,16] and speaker diarization [17] where (1) there is no enrollment-verification 2-stage process in CAD tasks; and (2) not every speaker need to be identified.…”
Section: Related Workmentioning
confidence: 99%
“…We choose the standard scaled dot-product as our attention function [18]. The scaled dot-product scores can be viewed as the substitutes of the cosine similarities between voice embedding vectors, which have been commonly used as a calibration for the acoustic similarity between different speakers' utterances [14,17]. After that, we compute the multimodal representation by attending scores to contextual language features.…”
Section: Multimodal Attention Layermentioning
confidence: 99%
“…To smooth and denoise the data, we employ the similarity matrix enhancement introduced in [6] with the Gaussian Blur step removed. This operation improves the system performance and the detailed procedure is listed as follows:…”
Section: Similarity Matrix Enhancementmentioning
confidence: 99%
“…Speaker embedding is a vector of fixed dimensionality that represents speaker's characteristics and can be extracted from a reference recording of speaker's speech. Speaker embeddings have been shown to be a useful source of information about speaker in many tasks, including speaker verification, speaker diarization [23], speech synthesis [24] and speech separation [12]. We condition ASR system for the recognition of speech of a certain speaker in the recording of overlapped speech.…”
Section: Speaker Embeddingsmentioning
confidence: 99%