Continuous Speech Separation: Dataset and Analysis

Chen, Zhuo; Yoshioka, Takuya; Lu, Liang; Zhou, Tianyan; Meng, Ziyang; Luo, Yi; Wu, Jian; Xiao, Xiong; Li, Jinyu

doi:10.1109/icassp40776.2020.9053426

Cited by 156 publications

(161 citation statements)

References 45 publications

Supporting

Mentioning

159

Contrasting

Order By: Relevance

“…In the second experiment, we used the meeting-like LibriCSS corpus [18], which consists of 8-speaker meeting-like recordings sessions of 10 minutes, obtained by re-recording LibriSpeech utterances played through loudspeakers in a meeting room. The overlap ratio varies from 0 to 40 %.…”

Section: Datasetmentioning

confidence: 99%

See 1 more Smart Citation

Speaker Activity Driven Neural Speech Extraction

Delcroix

Žmolíková

Ochiai

et al. 2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

Target speech extraction, which extracts the speech of a target speaker in a mixture given auxiliary speaker clues, has recently received increased interest. Various clues have been investigated such as pre-recorded enrollment utterances, direction information, or video of the target speaker. In this paper, we explore the use of speaker activity information as an auxiliary clue for single-channel neural network-based speech extraction. We propose a speaker activity driven speech extraction neural network (ADEnet) and show that it can achieve performance levels competitive with enrollmentbased approaches, without the need for pre-recordings. We further demonstrate the potential of the proposed approach for processing meeting-like recordings, where the speaker activity is obtained from a diarization system. We show that this simple yet practical approach can successfully extract speakers after diarization, which results in improved ASR performance, especially in high overlapping conditions, with a relative word error rate reduction of up to 25 %.

show abstract

Section: Datasetmentioning

confidence: 99%

“…We discuss related works in Section 4. In Section 5, we present experimental results based on the LibriCSS corpus [18]. Finally, we conclude the paper in Section 6.…”

Section: Introductionmentioning

confidence: 99%

Speaker Activity Driven Neural Speech Extraction

Delcroix

Žmolíková

Ochiai

et al. 2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

show abstract

“…It contains 10 hours of audio recordings in regular meeting rooms. Each mini-session 1 in 1 Readers can refer to [23] to get more details.…”

Section: Datasetmentioning

confidence: 99%

“…All our models in the table use the window size of 2.4s. 0S/L[23]: 0% overlap ratio with short/long silence.…”

mentioning

confidence: 99%

Dual-Path Modeling for Long Recording Speech Separation in Meetings

Chen

Luo

et al. 2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Self Cite

View full text Add to dashboard Cite

The continuous speech separation (CSS) is a task to separate the speech sources from a long, partially overlapped recording, which involves a varying number of speakers. A straightforward extension of conventional utterance-level speech separation to the CSS task is to segment the long recording with a size-fixed window and process each window separately. Though effective, this extension fails to model the long dependency in speech and thus leads to sub-optimum performance. The recent proposed dual-path modeling could be a remedy to this problem, thanks to its capability in jointly modeling the cross-window dependency and the local-window processing. In this work, we further extend the dual-path modeling framework for CSS task. A transformer-based dual-path system is proposed, which integrates transform layers for global modeling. The proposed models are applied to LibriCSS, a real recorded multi-talk dataset, and consistent WER reduction can be observed in the ASR evaluation for separated speech. Also, a dual-path transformer equipped with convolutional layers is proposed. It significantly reduces the computation amount by 30% with better WER evaluation. Furthermore, the online processing dual-path models are investigated, which shows 10% relative WER reduction compared to the baseline.

show abstract

“…However, the microphone array in this dataset is only a circular array and cannot be changed, so it cannot be applied to the scenes that require a specific shape of the microphone array. Chen et al proposed a dataset [14] for evaluating continuous speech separation. In this dataset, the speech signal is continuous, containing both the overlapped and overlap-free components.…”

Section: Introductionmentioning

confidence: 99%

MASS: Microphone Array Speech Simulator in Room Acoustic Environment for Multi-Channel Speech Coding and Enhancement

2020

View full text Add to dashboard Cite

Multi-channel speech coding and enhancement is an indispensable technology in speech communication. In order to verify the effectiveness of multi-channel speech coding and enhancement methods in the research and development, a microphone array speech simulator (MASS) used in room acoustic environment is proposed. The proposed MASS is the improvement and extension of the existing multi-channel speech simulator. It aims to simulate clean speech, noisy speech, clean speech with reverberation, noisy speech with reverberation, and noise signals by microphone array used for multi-channel coding and enhancement of speech signal in room acoustic environment. The experimental results of the multi-channel speech coding and enhancement prove that the MASS could well simulate the signals used in real room acoustic environment and can be applied to the research of the related fields.

show abstract

Continuous Speech Separation: Dataset and Analysis

Cited by 156 publications

References 45 publications

Speaker Activity Driven Neural Speech Extraction

Speaker Activity Driven Neural Speech Extraction

Dual-Path Modeling for Long Recording Speech Separation in Meetings

MASS: Microphone Array Speech Simulator in Room Acoustic Environment for Multi-Channel Speech Coding and Enhancement

Contact Info

Product

Resources

About