A Review of Speaker Diarization: Recent Advances with Deep Learning

Park, Tae‐Jin; Kanda, Naoyuki; Dimitriadis, Dimitrios; Han, Ki Jin; Watanabe, Soichi; Narayanan, Shrikanth

doi:10.48550/arxiv.2101.09624

Cited by 12 publications

(21 citation statements)

References 175 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…EEND-EDA uses a Transformer encoder [29] without positional encodings (Figure 1a) for Encoder in (2). Given Ein ∈ R D×T , the encoder converts it into Eout ∈ R D×T as follows:…”

Section: Transformer Encodermentioning

confidence: 99%

“…Meeting transcription is one of the largest application areas of speech-related technologies. One important component of meeting transcription is speaker diarization [1,2], which gives speaker attributes to each transcribed utterance. In recent years, many endto-end diarization methods have been proposed [3,4,5,6] have achieved comparative accuracy to that of modular-based methods [7,8].…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Multi-Channel End-to-End Neural Diarization with Distributed Microphones

Horiguchi¹,

Takashima²,

Garcia³

et al. 2021

Preprint

Self Cite

View full text Add to dashboard Cite

Recent progress on end-to-end neural diarization (EEND) has enabled overlap-aware speaker diarization with a single neural network. This paper proposes to enhance EEND by using multi-channel signals from distributed microphones. We replace Transformer encoders in EEND with two types of encoders that process a multichannel input: spatio-temporal and co-attention encoders. Both are independent of the number and geometry of microphones and suitable for distributed microphone settings. We also propose a model adaptation method using only single-channel recordings. With simulated and real-recorded datasets, we demonstrated that the proposed method outperformed conventional EEND when a multi-channel input was given while maintaining comparable performance with a single-channel input. We also showed that the proposed method performed well even when spatial information is inoperative given multi-channel inputs, such as in hybrid meetings in which the utterances of multiple remote participants are played back from the same loudspeaker.

show abstract

“…EEND-EDA uses a Transformer encoder [29] without positional encodings (Figure 1a) for Encoder in (2). Given Ein ∈ R D×T , the encoder converts it into Eout ∈ R D×T as follows:…”

Section: Transformer Encodermentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Multi-Channel End-to-End Neural Diarization with Distributed Microphones

Horiguchi¹,

Takashima²,

Garcia³

et al. 2021

Preprint

Self Cite

View full text Add to dashboard Cite

show abstract

“…Following the serialized output training (SOT) framework [28], a multi-talker transcription is represented as a single sequence Y by concatenating the word sequences of the individual speakers with a special "speaker change" symbol sc . For example, the reference token sequence to Y for the three-speaker case is given as R = {r 1 1 , .., r 1 N 1 , sc , r 2 1 , .., r 2 N 2 , sc , r 3 1 , .., r 3 N 3 , eos }, where r j i represents the i-th token of the j-th speaker. A special symbol eos is inserted at the end of all transcriptions to determine the termination of inference.…”

Section: Overviewmentioning

confidence: 99%

“…Speaker diarization is a task of recognizing "who spoke when" from audio recordings [1]. A conventional approach is based on speaker embedding extraction for short segmented audio, followed by clustering of the embeddings (sometimes with some constraint regarding the speaker transitions) to attribute the speaker identity to each short segment.…”

Section: Introductionmentioning

confidence: 99%

Transcribe-to-Diarize: Neural Speaker Diarization for Unlimited Number of Speakers using End-to-End Speaker-Attributed ASR

Kanda¹,

Xiong²,

Gaur³

et al. 2021

Preprint

Self Cite

View full text Add to dashboard Cite

This paper presents Transcribe-to-Diarize, a new approach for neural speaker diarization that uses an end-to-end (E2E) speaker-attributed automatic speech recognition (SA-ASR). The E2E SA-ASR is a joint model that was recently proposed for speaker counting, multi-talker speech recognition, and speaker identification from monaural audio that contains overlapping speech. Although the E2E SA-ASR model originally does not estimate any time-related information, we show that the start and end times of each word can be estimated with sufficient accuracy from the internal state of the E2E SA-ASR by adding a small number of learnable parameters. Similar to the target-speaker voice activity detection (TS-VAD)-based diarization method, the E2E SA-ASR model is applied to estimate speech activity of each speaker while it has the advantages of (i) handling unlimited number of speakers, (ii) leveraging linguistic information for speaker diarization, and (iii) simultaneously generating speakerattributed transcriptions. Experimental results on the LibriCSS and AMI corpora show that the proposed method achieves significantly better diarization error rate than various existing speaker diarization methods when the number of speakers is unknown, and achieves a comparable performance to TS-VAD when the number of speakers is given in advance. The proposed method simultaneously generates speaker-attributed transcription with state-of-the-art accuracy.

show abstract

“…obtaining these speaker labels manually on a large dataset would be expensive and time consuming). A common approach is to use speaker diarization to classify the speakers in the audio [9,10]. Although the ATCO speech is often cleaner than the pilot (as the former communicates from a controlled acoustic environment), the speech recordings collected in ATCO2 project using Very High Frequency (VHF) receivers are noisy for both ATCO and pilot channels.…”

Section: Introductionmentioning

confidence: 99%

Grammar Based Speaker Role Identification for Air Traffic Control Speech Recognition

Prasad¹,

Zuluaga-Gómez²,

Motlíček³

et al. 2021

Preprint

View full text Add to dashboard Cite

Assistant Based Speech Recognition (ABSR) for air traffic control is generally trained by pooling both Air Traffic Controller (ATCO) and pilot data. In practice, this is motivated by the fact that the proportion of pilot data is lesser compared to ATCO while their standard language of communication is similar. However, due to data imbalance of ATCO and pilot and their varying acoustic conditions, the ASR performance is usually significantly better for ATCOs than pilots. In this paper, we propose to (1) split the ATCO and pilot data using an automatic approach exploiting ASR transcripts, and (2) consider ATCO and pilot ASR as two separate tasks for Acoustic Model (AM) training. For speaker role classification of ATCO and pilot data, a hypothesized ASR transcript is generated with a seed model, subsequently used to classify the speaker role based on the knowledge extracted from grammar defined by International Civil Aviation Organization (ICAO). This approach provides an average speaker role identification accuracy of 83% for ATCO and pilot. Finally, we show that training AMs separately for each task, or using a multitask approach is well suited for this data compared to AM trained by pooling all data.

show abstract

A Review of Speaker Diarization: Recent Advances with Deep Learning

Cited by 12 publications

References 175 publications

Multi-Channel End-to-End Neural Diarization with Distributed Microphones

Multi-Channel End-to-End Neural Diarization with Distributed Microphones

Transcribe-to-Diarize: Neural Speaker Diarization for Unlimited Number of Speakers using End-to-End Speaker-Attributed ASR

Grammar Based Speaker Role Identification for Air Traffic Control Speech Recognition

Contact Info

Product

Resources

About