Interrupted and Cascaded Permutation Invariant Training for Speech Separation

Yang, Gene-Ping; Wu, Szu-Lin; Mao, Yao-Wen; Lee, Hung-yi; Lee, Lin-Shan

doi:10.1109/icassp40776.2020.9053697

Cited by 8 publications

(11 citation statements)

References 23 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Our SI-SNRi results (16.5 and 17.5 dB) are promising. Further, we also confirmed the reduction of permutation errors and generalization of improvements to other test sets, which was not tested in [7,8].…”

Section: Additional Discussion: Previous Results On Wsj0-2mixsupporting

confidence: 70%

“…Prob-PIT [7] considers the probabilities of all utterance level permutations, rather than just the best one, improving the initial training stage when wrong alignments are likely to happen. A similar idea is employed by Yang et al [8], who trained a Conv-TasNet with uPIT and fixed alignments in turns, reporting 17.5 dB SI-SNRi. They also implemented Prob-PIT for Conv-TasNet and obtained 15.9 dB.…”

Section: Additional Discussion: Previous Results On Wsj0-2mixmentioning

confidence: 99%

“…However, the permutation frequently changes over frames at inference time. Improvements to PIT roughly fall into two categories: (i) designing a permutation (or speaker) tracking algorithm for tPIT [2,4,5]; and (ii) designing better uPIT objectives to further strengthen permutation consistency [6][7][8][9]. Along these two lines, our work takes a close look at tPIT+clustering, a recent idea introduced by Deep CASA [2], that targets at accurate frame level separation (tPIT) and speaker tracking (clustering) in two stages.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

On Permutation Invariant Training For Speech Source Separation

Liu

Pons

2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

We study permutation invariant training (PIT), which targets at the permutation ambiguity problem for speaker independent source separation models. We extend two state-of-the-art PIT strategies. First, we look at the two-stage speaker separation and tracking algorithm based on frame level PIT (tPIT) and clustering, which was originally proposed for the STFT domain, and we adapt it to work with waveforms and over a learned latent space. Further, we propose an efficient clustering loss scalable to waveform models. Second, we extend a recently proposed auxiliary speaker-ID loss with a deep feature loss based on "problem agnostic speech features", to reduce the local permutation errors made by the utterance level PIT (uPIT). Our results show that the proposed extensions help reducing permutation ambiguity. However, we also note that the studied STFTbased models are more effective at reducing permutation errors than waveform-based models, a perspective overlooked in recent studies.

show abstract

Section: Additional Discussion: Previous Results On Wsj0-2mixsupporting

confidence: 70%

Section: Additional Discussion: Previous Results On Wsj0-2mixmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

On Permutation Invariant Training For Speech Source Separation

Liu

Pons

2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

show abstract

“…This is the technique used in Permutation Invariant Training (PIT) speech separation method which has been shown to be effective in addressing the permutation ambiguity [35]. However, it has been discussed [39], [37] that the hard decision on choosing the minimum cost as the best solution results in training a sub-optimal separation model. To be more specific, the process of choosing the correct separation error is more challenging in the initial epochs of training, where the network is still naive and its outputs are not reliable.…”

Section: Problem Formulationmentioning

confidence: 99%

Single-channel speech separation using Soft-minimum Permutation Invariant Training

Yousefi¹,

Hansen²

2021

Preprint

View full text Add to dashboard Cite

The goal of speech separation is to extract multiple speech sources from a single microphone recording. Recently, with the advancement of deep learning and availability of large datasets, speech separation has been formulated as a supervised learning problem. These approaches aim to learn discriminative patterns of speech, speakers, and background noise using a supervised learning algorithm, typically a deep neural network. A long-lasting problem in supervised speech separation is finding the correct label for each separated speech signal, referred to as label permutation ambiguity. Permutation ambiguity refers to the problem of determining the output-label assignment between the separated sources and the available single-speaker speech labels. Finding the best output-label assignment is required for calculation of separation error, which is later used for updating parameters of the model. Recently, Permutation Invariant Training (PIT) has been shown to be a promising solution in handling the label ambiguity problem. However, the overconfident choice of the output-label assignment by PIT results in a sub-optimal trained model. In this work, we propose a probabilistic optimization framework to address the inefficiency of PIT in finding the best output-label assignment. Our proposed method entitled trainable Soft-minimum PIT is then employed on the same Long-Short Term Memory (LSTM) architecture used in Permutation Invariant Training (PIT) speech separation method. The results of our experiments show that the proposed method outperforms conventional PIT speech separation significantly (pvalue < 0.01) by +1dB in Signal to Distortion Ratio (SDR) and +1.5dB in Signal to Interference Ratio (SIR).

show abstract

“…Although the PIT forces the frames belonging to the same speaker to be aligned with the same output stream, frames inside one utterance can still flip between different sources, leading to a poor separation performance. Alternatively, the initial PIT-based separation model can be further trained with a fixed label training strategy [3], or a long term dependency can be imposed to the output streams by adding an additional speaker identity loss [4,5]. Another issue in blind source separation is that the speaker order of the separated signals during inference is also unknown, and needs to be identified by a speaker recognition system.…”

Section: Introductionmentioning

confidence: 99%

Time-Domain Speech Extraction with Spatial Information and Multi Speaker Conditioning Mechanism

Zhang

Zorilă

Doddipatla

et al. 2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

In this paper, we present a novel multi-channel speech extraction system to simultaneously extract multiple clean individual sources from a mixture in noisy and reverberant environments. The proposed method is built on an improved multi-channel time-domain speech separation network which employs speaker embeddings to identify and extract multiple targets without label permutation ambiguity. To efficiently inform the speaker information to the extraction model, we propose a new speaker conditioning mechanism by designing an additional speaker branch for receiving external speaker embeddings. Experiments on 2-channel WHAMR! data show that the proposed system improves by 9% relative the source separation performance over a strong multi-channel baseline, and it increases the speech recognition accuracy by more than 16% relative over the same baseline.

show abstract

Interrupted and Cascaded Permutation Invariant Training for Speech Separation

Cited by 8 publications

References 23 publications

On Permutation Invariant Training For Speech Source Separation

On Permutation Invariant Training For Speech Source Separation

Single-channel speech separation using Soft-minimum Permutation Invariant Training

Time-Domain Speech Extraction with Spatial Information and Multi Speaker Conditioning Mechanism

Contact Info

Product

Resources

About