Triplet Network with Attention for Speaker Diarization

Song, Huan; Willi, Megan M.; Thiagarajan, Jayaraman J.; Berisha, Visar; Spanias, Andreas

doi:10.21437/interspeech.2018-2305

Cited by 12 publications

(17 citation statements)

References 25 publications

(51 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Similar to the word-level representations, the sequential sentences representations are fused using either a naïve approach as described in Equation 4, a global attention approach as defined in Equations (8) to (11), or a contextual attention approach as detailed in Equations (15) to (17). This generates a single highlevel representationh of all sentences in the whole interview.…”

Section: High-level Representationmentioning

confidence: 99%

“…Recently, attention mechanisms have been employed in a broad range of applications, such as acoustic scene classification [7], speaker diarisation [8], speech emotion recognition [9], image classification [10], video classification [11], and video description [12]. Attention mechanisms with linguistic information have also been used in document classification [13] and sentiment and self-assessed emotion detection [14] problems.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

A Hierarchical Attention Network-Based Approach for Depression Detection from Transcribed Clinical Interviews

Mallol-Ragolta¹,

Zhao²,

Stappen³

et al. 2019

Interspeech 2019

View full text Add to dashboard Cite

The high prevalence of depression in society has given rise to a need for new digital tools that can aid its early detection. Among other effects, depression impacts the use of language. Seeking to exploit this, this work focuses on the detection of depressed and non-depressed individuals through the analysis of linguistic information extracted from transcripts of clinical interviews with a virtual agent. Specifically, we investigated the advantages of employing hierarchical attention-based networks for this task. Using Global Vectors (GloVe) pretrained word embedding models to extract low-level representations of the words, we compared hierarchical local-global attention networks and hierarchical contextual attention networks. We performed our experiments on the Distress Analysis Interview Corpus-Wizard of Oz (DAIC-WoZ) dataset, which contains audio, visual, and linguistic information acquired from participants during a clinical session. Our results using the DAIC-WoZ test set indicate that hierarchical contextual attention networks are the most suitable configuration to detect depression from transcripts. The configuration achieves an Unweighted Average Recall (UAR) of .66 using the test set, surpassing our baseline, a Recurrent Neural Network that does not use attention.

show abstract

Section: High-level Representationmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

A Hierarchical Attention Network-Based Approach for Depression Detection from Transcribed Clinical Interviews

Mallol-Ragolta¹,

Zhao²,

Stappen³

et al. 2019

Interspeech 2019

View full text Add to dashboard Cite

show abstract

“…Portions of this work were performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344. for example cosine similarity between i−vectors, more recent solutions have emphasized the importance of integrating a metric learning pipeline into diarization systems [5,6,7]. This naturally allows knowledge inferred from an external data source to be utilized while performing diarization on an unseen target data.…”

Section: Introductionmentioning

confidence: 99%

“…This amounts to inferring key factors in data, while encoding higher order interactions, to ensure that examples from the same speaker are within smaller distances, compared to examples from a different speaker [9]. While a variety of formulations exist for supervised metric learning [7,10], recent approaches have relied on deep networks to construct embeddings that satisfy the supervisory constraints. Popular examples include the siamese [11], triplet [12,9], and quadruplet [13] networks.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Designing an Effective Metric Learning Pipeline for Speaker Diarization

Narayanaswamy¹,

Thiagarajan²,

Song³

et al. 2019

ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Self Cite

View full text Add to dashboard Cite

State-of-the-art speaker diarization systems utilize knowledge from external data, in the form of a pre-trained distance metric, to effectively determine relative speaker identities to unseen data. However, much of recent focus has been on choosing the appropriate feature extractor, ranging from pre-trained i−vectors to representations learned via different sequence modeling architectures (e.g. 1D-CNNs, LSTMs, attention models), while adopting off-the-shelf metric learning solutions. In this paper, we argue that, regardless of the feature extractor, it is crucial to carefully design a metric learning pipeline, namely the loss function, the sampling strategy and the discrimnative margin parameter, for building robust diarization systems. Furthermore, we propose to adopt a fine-grained validation process to obtain a comprehensive evaluation of the generalization power of metric learning pipelines. To this end, we measure diarization performance across different language speakers, and variations in the number of speakers in a recording. Using empirical studies, we provide interesting insights into the effectiveness of different design choices and make recommendations.

show abstract

Learning Efficient Representations for Keyword Spotting with Triplet Loss

Vygon

Mikhaylovskiy

2021

Speech and Computer

View full text Add to dashboard Cite

In the past few years, triplet loss-based metric embeddings have become a de-facto standard for several important computer vision problems, most notably, person reidentification. On the other hand, in the area of speech recognition the metric embeddings generated by the triplet loss are rarely used even for classification problems. We fill this gap showing that a combination of two representation learning techniques: a triplet loss-based embedding and a variant of kNN for classification instead of cross-entropy loss significantly (by 26% to 38%) improves the classification accuracy for convolutional networks on a LibriSpeech-derived LibriWords datasets. To do so, we propose a novel phonetic similarity based triplet mining approach. We also match the current best published SOTA for Google Speech Commands dataset V2 10+2-class classification with an architecture that is about 6 times more compact and improve the current best published SOTA for 35class classification on Google Speech Commands dataset V2 by over 40%. 1

show abstract

Triplet Network with Attention for Speaker Diarization

Cited by 12 publications

References 25 publications

A Hierarchical Attention Network-Based Approach for Depression Detection from Transcribed Clinical Interviews

A Hierarchical Attention Network-Based Approach for Depression Detection from Transcribed Clinical Interviews

Designing an Effective Metric Learning Pipeline for Speaker Diarization

Learning Efficient Representations for Keyword Spotting with Triplet Loss

Contact Info

Product

Resources

About