2021
DOI: 10.48550/arxiv.2112.09382
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Discretization and Re-synthesis: an alternative method to solve the Cocktail Party Problem

Abstract: Deep learning based models have significantly improved the performance of speech separation with input mixtures like the cocktail party. Prominent methods (e.g., frequency-domain and time-domain speech separation) usually build regression models to predict the ground-truth speech from the mixture, using the masking-based design and the signal-level loss criterion (e.g., MSE or SI-SNR). This study demonstrates, for the first time, that the synthesis-based approach can also perform well on this problem, with gre… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
6
0

Year Published

2022
2022
2022
2022

Publication Types

Select...
4
1

Relationship

1
4

Authors

Journals

citations
Cited by 5 publications
(6 citation statements)
references
References 22 publications
0
6
0
Order By: Relevance
“…22 Going one step further, [314] used a speech emotion conversion framework to modify the perceived emotion of a speech utterance while preserving its lexical content and speaker identity. Other studies have extended the idea of textless language processing or audio discrete representation to applications such as spoken question answering [316], speech separation [317], TTS [318], and speech-to-speech translation [319].…”
Section: B No Text or Lexiconmentioning
confidence: 99%
“…22 Going one step further, [314] used a speech emotion conversion framework to modify the perceived emotion of a speech utterance while preserving its lexical content and speaker identity. Other studies have extended the idea of textless language processing or audio discrete representation to applications such as spoken question answering [316], speech separation [317], TTS [318], and speech-to-speech translation [319].…”
Section: B No Text or Lexiconmentioning
confidence: 99%
“…complex Gaussian distribution for tractability. Based on the inverse problem in (11), the proposed method is able to generate refined signals with DDRM, which is summarized in Algorithm 1. We can interpret this to mean that the algorithm generates speech by appropriately combining the estimated clean speech x (t) k,l with the observed noisy speech y k,l in accordance with the estimated noise variance at each time-frequency bin.…”
Section: Diffusion-based Refiner For Se Outputsmentioning
confidence: 99%
“…However, although PESQ and STOI are correlated with perceptual speech quality, optimizing the target SE model on the basis of these metrics does not always improve the actual quality perceived by humans because the mechanisms of PESQ and STOI are not perfectly equal to human listening. Shi et al and Liu et al hypothesized that synthesizing conditioned speech would improve perceptual quality and proposed using a vocoder for the SE task to generate clean speech [11,12]. However, training the vocoder with noisy speech tends to be more laborious than in the case of the SE model only, and it often degrades the final perceptual quality.…”
Section: Introductionmentioning
confidence: 99%
“…We decided to exclude both U2U models as well as evaluation metrics from the core functionality of the library as we believe these models should be an example usage. There are plenty of ways to evaluate the overall pipeline (Lakhotia et al, 2021;Dunbar et al, 2019Nguyen et al, 2020) as well as different ways to model the "pseudo-text" units (Shi et al, 2021;Kharitonov et al, 2021a;Polyak et al, 2021;Kreuk et al, 2021;Lee et al, 2021a), hence including them as an integral part of textless-lib will make the library over complicated and hard to use.…”
Section: Library Overviewmentioning
confidence: 99%