2021
DOI: 10.48550/arxiv.2101.09624
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

A Review of Speaker Diarization: Recent Advances with Deep Learning

Abstract: Speaker diarization is a task to label audio or video recordings with classes corresponding to speaker identity, or in short, a task to identify "who spoke when". In the early years, speaker diarization algorithms were developed for speech recognition on multispeaker audio recordings to enable speaker adaptive processing, but also gained its own value as a stand-alone application over time to provide speaker-specific meta information for downstream tasks such as audio retrieval. More recently, with the rise of… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
21
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
5
3

Relationship

3
5

Authors

Journals

citations
Cited by 12 publications
(21 citation statements)
references
References 175 publications
0
21
0
Order By: Relevance
“…EEND-EDA uses a Transformer encoder [29] without positional encodings (Figure 1a) for Encoder in (2). Given Ein ∈ R D×T , the encoder converts it into Eout ∈ R D×T as follows:…”
Section: Transformer Encodermentioning
confidence: 99%
See 1 more Smart Citation
“…EEND-EDA uses a Transformer encoder [29] without positional encodings (Figure 1a) for Encoder in (2). Given Ein ∈ R D×T , the encoder converts it into Eout ∈ R D×T as follows:…”
Section: Transformer Encodermentioning
confidence: 99%
“…Meeting transcription is one of the largest application areas of speech-related technologies. One important component of meeting transcription is speaker diarization [1,2], which gives speaker attributes to each transcribed utterance. In recent years, many endto-end diarization methods have been proposed [3,4,5,6] have achieved comparative accuracy to that of modular-based methods [7,8].…”
Section: Introductionmentioning
confidence: 99%
“…Following the serialized output training (SOT) framework [28], a multi-talker transcription is represented as a single sequence Y by concatenating the word sequences of the individual speakers with a special "speaker change" symbol sc . For example, the reference token sequence to Y for the three-speaker case is given as R = {r 1 1 , .., r 1 N 1 , sc , r 2 1 , .., r 2 N 2 , sc , r 3 1 , .., r 3 N 3 , eos }, where r j i represents the i-th token of the j-th speaker. A special symbol eos is inserted at the end of all transcriptions to determine the termination of inference.…”
Section: Overviewmentioning
confidence: 99%
“…Speaker diarization is a task of recognizing "who spoke when" from audio recordings [1]. A conventional approach is based on speaker embedding extraction for short segmented audio, followed by clustering of the embeddings (sometimes with some constraint regarding the speaker transitions) to attribute the speaker identity to each short segment.…”
Section: Introductionmentioning
confidence: 99%
“…obtaining these speaker labels manually on a large dataset would be expensive and time consuming). A common approach is to use speaker diarization to classify the speakers in the audio [9,10]. Although the ATCO speech is often cleaner than the pilot (as the former communicates from a controlled acoustic environment), the speech recordings collected in ATCO2 project using Very High Frequency (VHF) receivers are noisy for both ATCO and pilot channels.…”
Section: Introductionmentioning
confidence: 99%