ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2020
DOI: 10.1109/icassp40776.2020.9053182
|View full text |Cite
|
Sign up to set email alerts
|

Joint Phoneme Alignment and Text-Informed Speech Separation on Highly Corrupted Speech

Abstract: HAL is a multidisciplinary open access archive for the deposit and dissemination of scientific research documents, whether they are published or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers. L'archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d'enseignement et de recherche français ou étrangers, des labora… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
19
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
4
2

Relationship

0
6

Authors

Journals

citations
Cited by 20 publications
(19 citation statements)
references
References 19 publications
(29 reference statements)
0
19
0
Order By: Relevance
“…Recent studies [5,6,7] attempt to introduce phoneme information to a speech enhancement network. [5] proposes a phoneme-specific network for speech enhancement.…”
Section: Related Workmentioning
confidence: 99%
See 2 more Smart Citations
“…Recent studies [5,6,7] attempt to introduce phoneme information to a speech enhancement network. [5] proposes a phoneme-specific network for speech enhancement.…”
Section: Related Workmentioning
confidence: 99%
“…Recent studies [5,6,7] attempt to introduce phoneme information to a speech enhancement network. [5] phoneme predictions will lead to severe degradation in enhanced speech.…”
Section: Related Workmentioning
confidence: 99%
See 1 more Smart Citation
“…The proposed neural forced alignment model learns the phone-to-audio alignment through the self-supervised task of reconstructing the quantized embeddings of original speech with both heavily masked speech representations and phone-mic information [12]. This could be implemented as the same pretraining task of Wav2Vec2 [8].…”
Section: Neural Forced Alignmentmentioning
confidence: 99%
“…One reason is that model ASR has increasingly shifted towards end-to-end training using loss functions like CTC [9] that disregards precise frame alignment. Only a few works explored using neural networks to perform segmentation of sentences [10] and phones [11,12,13]. These works demonstrate great potentials for neural forced alignment, but they still required text transcriptions.…”
Section: Introductionmentioning
confidence: 99%