Enhancement of esophageal speech obtained by a voice conversion technique using time dilated Fourier cepstra

Othmane, Imen Ben; Martino, Joseph Di; Ouni, Kaïs

doi:10.1007/s10772-018-09579-1

Cited by 7 publications

(6 citation statements)

References 42 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The time dilated Fourier Cepstra, as defined and detailed in [18], is used to enhance the esophageal speech in the frequency domain, by dilating the frequency axis of ratio 1/α. Thus, the frequency components will be changed, without corrupting the speech signal.…”

Section: The Time Dilated Fourier Cepstra Methodsmentioning

confidence: 99%

Denoising Esophageal Speech using Combination of Complex and Discrete Wavelet Transform with Wiener filter and Time Dilated Fourier Cepstra

et al. 2022

Self Cite

View full text Add to dashboard Cite

Esophageal speech is one of the pathological voices, which is known to be weak in intelligibility and hard to understand. Our approach's main idea is to reduce the esophageal speech noises using two-hybrid methods. This paper aims to merge the advantages of wavelet-based methods such as DWT and DTCWT, along with the standard methods such as the Wiener filter and the time dilated Fourier. The first hybrid method applies the filters on the vocal tract cepstrum, while the second one applies them at the synthesis stage. Two experiments were conducted as well to evaluate the results by objective analysis. The results obtained by the proposed hybrid methods gave good performances.

show abstract

Section: The Time Dilated Fourier Cepstra Methodsmentioning

confidence: 99%

Denoising Esophageal Speech using Combination of Complex and Discrete Wavelet Transform with Wiener filter and Time Dilated Fourier Cepstra

et al. 2022

Self Cite

View full text Add to dashboard Cite

show abstract

“…The source (x n ) and target (y n ) vectors previously aligned by the DTW algorithm are concatenated together into an extended vector z n = [x n , y n ] and then the GMM parameters that model the joint probability density are estimated. • DNN: the DNN-based VC system was implemented based on the approach of [10].…”

Section: Experimental Setupsmentioning

confidence: 99%

“…Due to the extensive use of the esophageal voice by laryngectomees, this type of voice has been the subject of numerous studies in the last few years. To our knowledge, the existing approaches for ES quality improvements can be summarized into three categories: approaches based on the transformation of acoustic features, such as formant synthesis [4], comb filtering [5], and smoothing of acoustic parameters [6]; approaches based on statistical techniques, where [7][8][9] have been carried out, and approaches based on the VC technique, which allows for the transformation of the voice of a source speaker (laryngectomee) into that of a target speaker (laryngeal) [10][11][12][13][14][15][16]. Although these approaches have of course improved the estimation of the acoustic characteristics to reconstruct a converted signal with better quality, the improvements in intelligibility and naturalness are still insufficient.…”

Section: Introductionmentioning

confidence: 99%

Intelligibility Improvement of Esophageal Speech Using Sequence-to-Sequence Voice Conversion with Auditory Attention

2022

Self Cite

View full text Add to dashboard Cite

Laryngectomees are individuals whose larynx has been surgically removed, usually due to laryngeal cancer. The immediate consequence of this operation is that these individuals (laryngectomees) are unable to speak. Esophageal speech (ES) remains the preferred alternative speaking method for laryngectomees. However, compared to the laryngeal voice, ES is characterized by low intelligibility and poor quality due to chaotic fundamental frequency F0, specific noises, and low intensity. Our proposal to solve these problems is to take advantage of voice conversion as an effective way to improve speech quality and intelligibility. To this end, we propose in this work a novel esophageal–laryngeal voice conversion (VC) system based on a sequence-to-sequence (Seq2Seq) model combined with an auditory attention mechanism. The originality of the proposed framework is that it adopts an auditory attention technique in our model, which leads to more efficient and adaptive feature mapping. In addition, our VC system does not require the classical DTW alignment process during the learning phase, which avoids erroneous mappings and significantly reduces the computational time. Moreover, to preserve the identity of the target speaker, the excitation and phase coefficients are estimated by querying a binary search tree. In experiments, objective and subjective tests confirmed that the proposed approach performs better even in some difficult cases in terms of speech quality and intelligibility.

show abstract

“…This conversion function can then be used to convert new OS samples, thereby getting OS speech that has characteristics of HS. In recent times, Deep Neural Networks (DNN) are more popular and effective compared to GMM based methods for enhancement of alaryngeal speech [20][21][22][23] and other types of pathological speech [24,25]. Another attempt to enrich OS was by using the eigenvoices concept [26], which was inspired by the eigenfaces concept [27].…”

Section: Introductionmentioning

confidence: 99%

Enrichment of Oesophageal Speech: Voice Conversion with Duration–Matched Synthetic Speech as Target

et al. 2021

View full text Add to dashboard Cite

Pathological speech such as Oesophageal Speech (OS) is difficult to understand due to the presence of undesired artefacts and lack of normal healthy speech characteristics. Modern speech technologies and machine learning enable us to transform pathological speech to improve intelligibility and quality. We have used a neural network based voice conversion method with the aim of improving the intelligibility and reducing the listening effort (LE) of four OS speakers of varying speaking proficiency. The novelty of this method is the use of synthetic speech matched in duration with the source OS as the target, instead of parallel aligned healthy speech. We evaluated the converted samples from this system using a collection of Automatic Speech Recognition systems (ASR), an objective intelligibility metric (STOI) and a subjective test. ASR evaluation shows that the proposed system had significantly better word recognition accuracy compared to unprocessed OS, and baseline systems which used aligned healthy speech as the target. There was an improvement of at least 15% on STOI scores indicating a higher intelligibility for the proposed system compared to unprocessed OS, and a higher target similarity in the proposed system compared to baseline systems. The subjective test reveals a significant preference for the proposed system compared to unprocessed OS for all OS speakers, except one who was the least proficient OS speaker in the data set.

show abstract

Enhancement of esophageal speech obtained by a voice conversion technique using time dilated Fourier cepstra

Cited by 7 publications

References 42 publications

Denoising Esophageal Speech using Combination of Complex and Discrete Wavelet Transform with Wiener filter and Time Dilated Fourier Cepstra

Denoising Esophageal Speech using Combination of Complex and Discrete Wavelet Transform with Wiener filter and Time Dilated Fourier Cepstra

Intelligibility Improvement of Esophageal Speech Using Sequence-to-Sequence Voice Conversion with Auditory Attention

Enrichment of Oesophageal Speech: Voice Conversion with Duration–Matched Synthetic Speech as Target

Contact Info

Product

Resources

About