Joint Phoneme Alignment and Text-Informed Speech Separation on Highly Corrupted Speech

Schulze-Forster, Kilian; Doire, Clement S. J.; Richard, Gaël; Badeau, Roland

doi:10.1109/icassp40776.2020.9053182

Cited by 20 publications

(19 citation statements)

References 19 publications

(29 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Recent studies [5,6,7] attempt to introduce phoneme information to a speech enhancement network. [5] proposes a phoneme-specific network for speech enhancement.…”

Section: Related Workmentioning

confidence: 99%

“…Recent studies [5,6,7] attempt to introduce phoneme information to a speech enhancement network. [5] phoneme predictions will lead to severe degradation in enhanced speech.…”

Section: Related Workmentioning

confidence: 99%

“…There are several studies along this line, which introduce phoneme information to help speech enhancement by many phoneme-specific networks or simply concatenating known phoneme input with the noisy signal [5,6,7]. However, some are too complex as each phoneme corresponds to one enhancement network and some need known phoneme as an additional input.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Phoneme-Based Distribution Regularization for Speech Enhancement

Liu

Peng

Xiong

et al. 2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

Existing speech enhancement methods mainly separate speech from noises at the signal level or in the time-frequency domain. They seldom pay attention to the semantic information of a corrupted signal. In this paper, we aim to bridge this gap by extracting phoneme identities to help speech enhancement. Specifically, we propose a phoneme-based distribution regularization (PbDr) for speech enhancement, which incorporates frame-wise phoneme information into speech enhancement network in a conditional manner. As different phonemes always lead to different feature distributions in frequency, we propose to learn a parameter pair, i.e. scale and bias, through a phoneme classification vector to modulate the speech enhancement network. The modulation parameter pair includes not only frame-wise but also frequency-wise conditions, which effectively map features to phoneme-related distributions. In this way, we explicitly regularize speech enhancement features by recognition vectors. Experiments on public datasets demonstrate that the proposed PbDr module can not only boost the perceptual quality for speech enhancement but also the recognition accuracy of an ASR system on the enhanced speech. This PbDr module could be readily incorporated into other speech enhancement networks as well.

show abstract

“…Recent studies [5,6,7] attempt to introduce phoneme information to a speech enhancement network. [5] proposes a phoneme-specific network for speech enhancement.…”

Section: Related Workmentioning

confidence: 99%

“…Recent studies [5,6,7] attempt to introduce phoneme information to a speech enhancement network. [5] phoneme predictions will lead to severe degradation in enhanced speech.…”

Section: Related Workmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Phoneme-Based Distribution Regularization for Speech Enhancement

Liu

Peng

Xiong

et al. 2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

show abstract

“…The proposed neural forced alignment model learns the phone-to-audio alignment through the self-supervised task of reconstructing the quantized embeddings of original speech with both heavily masked speech representations and phone-mic information [12]. This could be implemented as the same pretraining task of Wav2Vec2 [8].…”

Section: Neural Forced Alignmentmentioning

confidence: 99%

“…One reason is that model ASR has increasingly shifted towards end-to-end training using loss functions like CTC [9] that disregards precise frame alignment. Only a few works explored using neural networks to perform segmentation of sentences [10] and phones [11,12,13]. These works demonstrate great potentials for neural forced alignment, but they still required text transcriptions.…”

Section: Introductionmentioning

confidence: 99%

Phone-to-audio alignment without text: A Semi-supervised Approach

Zhu¹,

Zhang²,

Jurgens³

2021

Preprint

View full text Add to dashboard Cite

The task of phone-to-audio alignment has many applications in speech research. Here we introduce two Wav2Vec2-based models for both text-dependent and text-independent phoneto-audio alignment. The proposed Wav2Vec2-FS, a semisupervised model, directly learns phone-to-audio alignment through contrastive learning and a forward sum loss, and can be coupled with a pretrained phone recognizer to achieve textindependent alignment. The other model, Wav2Vec2-FC, is a frame classification model trained on forced aligned labels that can both perform forced alignment and text-independent segmentation. Evaluation results suggest that both proposed methods, even when transcriptions are not available, generate highly close results to existing forced alignment tools. Our work presents a neural pipeline of fully automated phone-toaudio alignment. Code and pretrained models are available at https://github.com/lingjzhu/charsiu.

show abstract