Self-Attentive Similarity Measurement Strategies in Speaker Diarization

Lin, Quan; Hou, Yu; Li, Ming

doi:10.21437/interspeech.2020-1908

Cited by 14 publications

(11 citation statements)

References 16 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…There are also several different approaches to generate the affinity matrix. In [152], self-attention-based network was introduced to directly generate a similarity matrix from a sequence of speaker embeddings. In [153], several affinity matrices with different temporal resolutions were fused into single affinity matrix based on a neural network.…”

Section: Single-module Optimization 311 Speaker Clustering Enhanced B...mentioning

confidence: 99%

A Review of Speaker Diarization: Recent Advances with Deep Learning

Park¹,

Kanda²,

Dimitriadis³

et al. 2021

Preprint

View full text Add to dashboard Cite

Speaker diarization is a task to label audio or video recordings with classes corresponding to speaker identity, or in short, a task to identify "who spoke when". In the early years, speaker diarization algorithms were developed for speech recognition on multispeaker audio recordings to enable speaker adaptive processing, but also gained its own value as a stand-alone application over time to provide speaker-specific meta information for downstream tasks such as audio retrieval. More recently, with the rise of deep learning technology that has been a driving force to revolutionary changes in research and practices across speech application domains in the past decade, more rapid advancements have been made for speaker diarization. In this paper, we review not only the historical development of speaker diarization technology but also the recent advancements in neural speaker diarization approaches. We also discuss how speaker diarization systems have been integrated with speech recognition applications and how the recent surge of deep learning is leading the way of jointly modeling these two components to be complementary to each other. By considering such exciting technical trends, we believe that it is a valuable contribution to the community to provide a survey work by consolidating the recent developments with neural methods and thus facilitating further progress towards a more efficient speaker diarization.

show abstract

Section: Single-module Optimization 311 Speaker Clustering Enhanced B...mentioning

confidence: 99%

A Review of Speaker Diarization: Recent Advances with Deep Learning

Park¹,

Kanda²,

Dimitriadis³

et al. 2021

Preprint

View full text Add to dashboard Cite

show abstract

“…Considering that the CTS data is so different from the remaining non-conversation telephone speech (NCTS) 16kHz audio signal, we build two different systems for CTS data and NCTS data. For NCTS data, we employ the system described in [6]. For CTS data, we first use AHC to determine the homogeneous speaker region.…”

Section: Data Partition and Data Resourcesmentioning

confidence: 99%

“…For NCTS data, we employ an attention-based neural network to measure the similarity between two segments. The network architecture and training process are the same as the attentive vector-to-sequence (Att-v2s) scoring in [6]. The architecture of this transformer-based model consists of a multi-head self-attention module and several linear layers, as Figure shows.…”

Section: Similarity Measurement and Clusteringmentioning

confidence: 99%

See 1 more Smart Citation

The DKU-Duke-Lenovo System Description for the Fearless Steps Challenge Phase III

Wang¹,

Cai²,

Jin³

et al. 2021

Interspeech 2021

Self Cite

View full text Add to dashboard Cite

In this paper, we present the submitted system for the third DIHARD Speech Diarization Challenge from the DKU-Duke-Lenovo team. Our system consists of several modules: voice activity detection (VAD), segmentation, speaker embedding extraction, attentive similarity scoring, agglomerative hierarchical clustering. In addition, the target speaker VAD (TSVAD) is used for the phone call data to further improve the performance. Our final submitted system achieves a DER of 15.43% for the core evaluation set and 13.39% for the full evaluation set on task 1, and we also get a DER of 21.63% for core evaluation set and 18.90% for full evaluation set on task 2.

show abstract

“…Popular similarity measurements include cosine similarity [169] and PLDAbases similarity [167,179]. Recently, some deep-learningbased similarity measurements were also introduced, such as the LSTM-based scoring [188], self-attentive similarity measurement strategies [189], and joint training of speaker embedding and PLDA scoring [166]. Common clustering algorithms include k-means [169], agglomerative hierarchical clustering [167], spectral clustering [188,169], Bayesian Hidden Markov Model based clustering [181,182,183] etc.…”

Section: Speaker Clusteringmentioning

confidence: 99%

Speaker Recognition Based on Deep Learning: An Overview

Bai

Zhang

2020

Preprint

View full text Add to dashboard Cite

Speaker recognition is a task of identifying persons from their voices. Recently, deep learning has dramatically revolutionized speaker recognition. However, there is lack of comprehensive reviews on the exciting progress. In this paper, we review several major subtasks of speaker recognition, including speaker verification, identification, diarization, and robust speaker recognition, with a focus on deep-learning-based methods. Because the major advantage of deep learning over conventional methods is its representation ability, which is able to produce highly abstract embedding features from utterances, we first pay close attention to deep-learning-based speaker feature extraction, including the inputs, network structures, temporal pooling strategies, and objective functions respectively, which are the fundamental components of many speaker recognition subtasks. Then, we make an overview of speaker diarization, with an emphasis of recent supervised, end-to-end, and online diarization. Finally, we survey robust speaker recognition from the perspectives of domain adaptation and speech enhancement, which are two major approaches of dealing with domain mismatch and noise problems. Popular and recently released corpora are listed at the end of the paper.

show abstract

Self-Attentive Similarity Measurement Strategies in Speaker Diarization

Cited by 14 publications

References 16 publications

A Review of Speaker Diarization: Recent Advances with Deep Learning

A Review of Speaker Diarization: Recent Advances with Deep Learning

The DKU-Duke-Lenovo System Description for the Fearless Steps Challenge Phase III

Speaker Recognition Based on Deep Learning: An Overview

Contact Info

Product

Resources

About