Interspeech 2021 2021
DOI: 10.21437/interspeech.2021-1909
|View full text |Cite
|
Sign up to set email alerts
|

End-to-End Neural Diarization: From Transformer to Conformer

Abstract: We propose a new end-to-end neural diarization (EEND) system that is based on Conformer, a recently proposed neural architecture that combines convolutional mappings and Transformer to model both local and global dependencies in speech. We first show that data augmentation and convolutional subsampling layers enhance the original self-attentive EEND in the Transformer-based EEND, and then Conformer gives an additional gain over the Transformer-based EEND. However, we notice that the Conformer-based EEND does n… Show more

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
34
0

Year Published

2021
2021
2023
2023

Publication Types

Select...
6
1
1

Relationship

1
7

Authors

Journals

citations
Cited by 26 publications
(34 citation statements)
references
References 20 publications
0
34
0
Order By: Relevance
“…End-to-end neural diarization (EEND) [7][8][9] is a method for estimating the speech activities of each speaker from a multiple speaker input mixture using an end-to-end neural network. Given a T -length sequence of F -dimensional acoustic features…”
Section: Background Methodsmentioning
confidence: 99%
See 1 more Smart Citation
“…End-to-end neural diarization (EEND) [7][8][9] is a method for estimating the speech activities of each speaker from a multiple speaker input mixture using an end-to-end neural network. Given a T -length sequence of F -dimensional acoustic features…”
Section: Background Methodsmentioning
confidence: 99%
“…Hence, they have a problem with realistic data with speaker overlaps. Alternatively, fully end-to-end neural diarization (EEND) [7][8][9] systems can handle speaker overlap by training with the speaker overlap data. One drawback of EEND is the number of speakers has to be known and fixed beforehand.…”
Section: Introductionmentioning
confidence: 99%
“…Our baseline model is a Conformer-based EEND model described in [13]. The acoustic input to the model XAcoustic is subsampled by a factor of 4 with a 2-D convolutional layer.…”
Section: Conformer-based Diarizationmentioning
confidence: 99%
“…Four Conformer layers of size 256 with 4 attention heads are used in the encoder. No positional encodings are used based on the finding in [13]. All models are trained with a batch size of 192 with the Adam optimizer, with the learning rate scheme described in [23] for 200 epochs.…”
Section: Data and Experimentsmentioning
confidence: 99%
“…Alternatively, the end-to-end approach for speaker diarization is gaining attention due to its simple architecture and promising results compared to the conventional cascaded systems [17,18,19,20]. In this approach, diarization models are designed to estimate each speaker's speech activities from an input multi-speaker conversational recording.…”
Section: Introductionmentioning
confidence: 99%