Interspeech 2018 2018
DOI: 10.21437/interspeech.2018-2305
|View full text |Cite
|
Sign up to set email alerts
|

Triplet Network with Attention for Speaker Diarization

Abstract: In automatic speech processing systems, speaker diarization is a crucial front-end component to separate segments from different speakers. Inspired by the recent success of deep neural networks (DNNs) in semantic inferencing, triplet loss-based architectures have been successfully used for this problem. However, existing work utilizes conventional i-vectors as the input representation and builds simple fully connected networks for metric learning, thus not fully leveraging the modeling power of DNN architectur… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
4
1

Citation Types

0
17
0

Year Published

2018
2018
2022
2022

Publication Types

Select...
4
2
1

Relationship

1
6

Authors

Journals

citations
Cited by 12 publications
(17 citation statements)
references
References 25 publications
(51 reference statements)
0
17
0
Order By: Relevance
“…Similar to the word-level representations, the sequential sentences representations are fused using either a naïve approach as described in Equation 4, a global attention approach as defined in Equations (8) to (11), or a contextual attention approach as detailed in Equations (15) to (17). This generates a single highlevel representationh of all sentences in the whole interview.…”
Section: High-level Representationmentioning
confidence: 99%
See 1 more Smart Citation
“…Similar to the word-level representations, the sequential sentences representations are fused using either a naïve approach as described in Equation 4, a global attention approach as defined in Equations (8) to (11), or a contextual attention approach as detailed in Equations (15) to (17). This generates a single highlevel representationh of all sentences in the whole interview.…”
Section: High-level Representationmentioning
confidence: 99%
“…Recently, attention mechanisms have been employed in a broad range of applications, such as acoustic scene classification [7], speaker diarisation [8], speech emotion recognition [9], image classification [10], video classification [11], and video description [12]. Attention mechanisms with linguistic information have also been used in document classification [13] and sentiment and self-assessed emotion detection [14] problems.…”
Section: Introductionmentioning
confidence: 99%
“…Portions of this work were performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344. for example cosine similarity between i−vectors, more recent solutions have emphasized the importance of integrating a metric learning pipeline into diarization systems [5,6,7]. This naturally allows knowledge inferred from an external data source to be utilized while performing diarization on an unseen target data.…”
Section: Introductionmentioning
confidence: 99%
“…This amounts to inferring key factors in data, while encoding higher order interactions, to ensure that examples from the same speaker are within smaller distances, compared to examples from a different speaker [9]. While a variety of formulations exist for supervised metric learning [7,10], recent approaches have relied on deep networks to construct embeddings that satisfy the supervisory constraints. Popular examples include the siamese [11], triplet [12,9], and quadruplet [13] networks.…”
Section: Introductionmentioning
confidence: 99%
See 1 more Smart Citation