CHiME-6 Challenge: Tackling Multispeaker Speech Recognition for Unsegmented Recordings

Watanabe, Shinji; Mandel, Michael I.; Barker, Jon; Vincent, Emmanuel; Arora, Ashish; Chang, Xuankai; Khudanpur, Sanjeev; Manohar, Vimal; Povey, Daniel; Raj, Desh; Snyder, David; Subramanian, Arvind; Trmal, Jan; Yair, Bar Ben; Boeddeker, Christoph; Ni, Zhaoheng; Fujita, Yusuke; Horiguchi, Shota; Kanda, Naoyuki; Yoshioka, Takuya; Ryant, Neville

doi:10.21437/chime.2020-1

Cited by 149 publications

(110 citation statements)

References 0 publications

Supporting

Mentioning

108

Contrasting

Order By: Relevance

“…It also seems interesting to be able to extend this type of study to new techniques proposed in the field of voice recognition, in which ideas based on the study of samples obtained in real noisy environments such as social gatherings, streets, cafes and restaurants are raised [108]. Likewise, an interesting line to take into account in this regard is given by the current challenges posed to address a voice recognition scenario capable of providing speech enhancement, speaker diarization and speech recognition modules, for example, by means of recognition modules based on multispeaker speech recognition for unsegmented recordings [109].…”

Section: Discussionmentioning

confidence: 99%

Using a Human Interviewer or an Automatic Interviewer in the Evaluation of Patients with AD from Speech

et al. 2021

View full text Add to dashboard Cite

Currently, there are more and more frequent studies focused on the evaluation of Alzheimer’s disease (AD) from the automatic analysis of the speech of patients, in order to detect the presence of the disease in an individual or for the evolutionary control of the disease. However, studies focused on analyzing the effect of the methodology used to generate the spontaneous speech of the speaker who undergoes this type of analysis are rare. The objective of this work is to study two different strategies to facilitate the generation of the spontaneous speech of a speaker for further analysis: the use of a human interviewer that promotes the generation of speech through an interview and the use of an automatic system (an automatic interviewer) that invites the speaker to describe certain visual stimuli. In this study, a database called Cross-Sectional Alzheimer Prognosis R2019 has been created, consisting of speech samples from speakers recorded using both methodologies. The speech recordings have been studied through a feature extraction based on five basic temporal measurements. This study demonstrates the discriminatory capacity between the speakers with AD and the control subjects independent of the strategy used in the generation of spontaneous speech. These results are promising and can serve as a basis for knowing the effectiveness and extension of automated interview processes, especially in telemedicine and telecare scenarios.

show abstract

Section: Discussionmentioning

confidence: 99%

Using a Human Interviewer or an Automatic Interviewer in the Evaluation of Patients with AD from Speech

et al. 2021

View full text Add to dashboard Cite

show abstract

“…In spontaneous human conversations different speakers tend to overlap with each other and, in meeting scenarios with more than two participants, the amount of overlapped speech can account for a significant portion of the total speech time, usually between 10% and 20% (McCowan et al, 2005;Watanabe et al, 2020). This phenomenon is one of the main obstacles towards fully reliable multi-party speech diarization (Ryant et al, 2018;García-Perera et al, 2020) and recognition (Watanabe et al, 2017;Vincent et al, 2018;Haeb-Umbach et al, 2019).…”

Section: Motivationmentioning

confidence: 99%

“…For this reason, Overlapped Speech Detection (OSD) is crucial to prevent back-end task performance degradation. This can be accomplished by including a reliable OSD algorithm together with Voice Activity Detection (VAD) in the very front-end part of the pipeline, possibly followed by speech separation (García-Perera et al, 2020;Watanabe et al, 2020). Speaker counting (Stöter et al, 2019) is a closely related task, which can be seen as an extension of VAD+OSD.…”

Section: Motivationmentioning

confidence: 99%

“…In a similar manner also the recently proposed Target-Speaker VAD (TS-VAD) framework of Medennikov et al (2020) can be employed for the purpose of counting speakers. This technique has been shown to be particularly effective even in challenging scenarios such as CHi-ME-6 (Watanabe et al, 2020). TS-VAD employs a neural network to estimate each speaker speech activity.…”

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

Overlapped Speech Detection and speaker counting using distant microphone arrays

Cornell

Omologo

Squartini

2022

Computer Speech & Language

Self Cite

View full text Add to dashboard Cite

We study the problem of detecting and counting simultaneous, overlapping speakers in a multichannel, distant-microphone scenario. Focusing on a supervised learning approach, we treat Voice Activity Detection (VAD), Overlapped Speech Detection (OSD), joint VAD and OSD (VAD+OSD) and speaker counting in a unified way, as instances of a general Overlapped Speech Detection and Counting (OSDC) multi-class supervised learning problem. We consider a Temporal Convolutional Network (TCN) and a Transformer based architecture for this task, and compare them with previously proposed state-of-the art methods based on Recurrent Neural Networks (RNN) or hybrid Convolutional-Recurrent Neural Networks (CRNN). In addition, we propose ways of exploiting multichannel input by means of early or late fusion of single-channel features with spatial features extracted from one or more microphone pairs. We conduct an extensive experimental evaluation on the AMI and CHiME-6 datasets and on a purposely made multichannel synthetic dataset. We show that the Transformer-based architecture performs best among all architectures and that neural network based spatial localization features outperform signal-based spatial features and significantly improve performance compared to single-channel features only. Finally, we find that training with a speaker counting objective improves OSD compared to training with a VAD+OSD objective.

show abstract

“…Speaker diarization has attracted attention because it can be used to boost the performance of ASR [23]. Motivated by the CHiME Challenges [24,25] and the DIHARD Challenges [26,27], several researchers have worked on developing more advanced speaker diarization system. Lin et al proposed a long short-term memory (LSTM)-based similarity measurement for the clustering-based speaker diarization.…”

Section: Related Workmentioning

confidence: 99%

End-to-End Speaker Diarization Conditioned on Speech Activity and Overlap Detection

Takashima

Fujita

Watanabe

et al. 2021

2021 IEEE Spoken Language Technology Workshop (SLT)

Self Cite

View full text Add to dashboard Cite

In this paper, we present a conditional multitask learning method for end-to-end neural speaker diarization (EEND). The EEND system has shown promising performance compared with traditional clustering-based methods, especially in the case of overlapping speech. In this paper, to further improve the performance of the EEND system, we propose a novel multitask learning framework that solves speaker diarization and a desired subtask while explicitly considering the task dependency. We optimize speaker diarization conditioned on speech activity and overlap detection that are subtasks of speaker diarization, based on the probabilistic chain rule. Experimental results show that our proposed method can leverage a subtask to effectively model speaker diarization, and outperforms conventional EEND systems in terms of diarization error rate.

show abstract

CHiME-6 Challenge: Tackling Multispeaker Speech Recognition for Unsegmented Recordings

Cited by 149 publications

References 0 publications

Using a Human Interviewer or an Automatic Interviewer in the Evaluation of Patients with AD from Speech

Using a Human Interviewer or an Automatic Interviewer in the Evaluation of Patients with AD from Speech

Overlapped Speech Detection and speaker counting using distant microphone arrays

End-to-End Speaker Diarization Conditioned on Speech Activity and Overlap Detection

Contact Info

Product

Resources

About