Progressive loss functions for speech enhancement with deep neural networks

Llombart, Jorge; Ribas, Dayana; Miguel, Antonio; Vicente, Luı́s; Giménez, Alfonso Ortega; Lleida, Eduardo

doi:10.1186/s13636-020-00191-3

Cited by 18 publications

(4 citation statements)

References 25 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Our model is a fully-convolutional denoising autoencoder with skip connections (Figure 1), in the style of previous effective SE models [27,30,34]. In training, we input a noisy audio waveform x ∈ R T , comprised of clean speech signal y ∈ R T and background noise n ∈ R T so that x = λy + (1 − λ)n, where λ is a parameter to control the SNR.…”

Section: Modelmentioning

confidence: 99%

TASE: Task-Aware Speech Enhancement for Wake-Up Word Detection in Voice Assistants

et al. 2022

View full text Add to dashboard Cite

Wake-up word spotting in noisy environments is a critical task for an excellent user experience with voice assistants. Unwanted activation of the device is often due to the presence of noises coming from background conversations, TVs, or other domestic appliances. In this work, we propose the use of a speech enhancement convolutional autoencoder, coupled with on-device keyword spotting, aimed at improving the trigger word detection in noisy environments. The end-to-end system learns by optimizing a linear combination of losses: a reconstruction-based loss, both at the log-mel spectrogram and at the waveform level, as well as a specific task loss that accounts for the cross-entropy error reported along the keyword spotting detection. We experiment with several neural network classifiers and report that deeply coupling the speech enhancement together with a wake-up word detector, e.g., by jointly training them, significantly improves the performance in the noisiest conditions. Additionally, we introduce a new publicly available speech database recorded for the Telefónica’s voice assistant, Aura. The OK Aura Wake-up Word Dataset incorporates rich metadata, such as speaker demographics or room conditions, and comprises hard negative examples that were studiously selected to present different levels of phonetic similarity with respect to the trigger words “OK Aura”.

show abstract

Section: Modelmentioning

confidence: 99%

TASE: Task-Aware Speech Enhancement for Wake-Up Word Detection in Voice Assistants

et al. 2022

View full text Add to dashboard Cite

show abstract

“…Llombart et al [28] developed the progressive SE using convolutional and residual neural network structures. In this system, 2 different conditions were used for optimizing the loss factor such as weighted and homogeneous progressive.…”

Section: Literature Surveymentioning

confidence: 99%

Ideal Ratio Mask Estimation using Supervised DNN Approach for Target Speech Signal Enhancement

Selvaraj

Chandra

2021

Preprint

View full text Add to dashboard Cite

The most challenging process in recent Speech Enhancement (SE) systems is to exclude the non-stationary noises and additive white Gaussian noise in real-time applications. Several SE techniques suggested were not successful in real-time scenarios to eliminate noises in the speech signals due to the high utilization of resources. So, a Sliding Window Empirical Mode Decomposition including a Variant of Variational Model Decomposition and Hurst (SWEMD-VVMDH) technique was developed for minimizing the difficulty in real-time applications. But, this is the statistical framework that takes a long time for computations. Hence in this article, this SWEMD-VVMDH technique is extended using Deep Neural Network (DNN) that learns the decomposed speech signals via SWEMD-VVMDH efficiently to achieve SE. At first, the noisy speech signals are decomposed into Intrinsic Mode Functions (IMFs) by the SWEMD Hurst (SWEMDH) technique. Then, the Time-Delay Estimation (TDE)-based VVMD was performed on the IMFs to elect the most relevant IMFs according to the Hurst exponent and lessen the low- as well as high-frequency noise elements in the speech signal. For each signal frame, the target features are chosen and fed to the DNN that learns these features to estimate the Ideal Ratio Mask (IRM) in a supervised manner. The abilities of DNN are enhanced for the categories of background noise, and the Signal-to-Noise Ratio (SNR) of the speech signals. Also, the noise category dimension and the SNR dimension are chosen for training and testing manifold DNNs since these are dimensions often taken into account for the SE systems. Further, the IRM in each frequency channel for all noisy signal samples is concatenated to reconstruct the noiseless speech signal. At last, the experimental outcomes exhibit considerable improvement in SE under different categories of noises.

show abstract

“…Nowadays, Deep learning systems have achieved human-compatible success in predicting the labelsinalmostalldomains.Manyartificialintelligentsubdomaintechniquesareappliedinvariety of appliances from Malware Detection (Kumar,2020), Object Recognition (Bayraktar,2019),Ima ge Classification (Ahuja, 2020) (Rajagopal,2020), Speech Recognition (Llombart, 2021), Natural LanguageProcessing (Do,2021),MedicalScience (Esteva,2017),SatelliteApplications (Kumar,2020), toFacialRecognitionsystems (Menon,2021).Withthegrowingadoptionofdeepneuralnetworks bymanycompanies,DNNtheuseofDNNinsafety-criticalenvironmentapplicationsincluding, Drones,Robotics,VoiceRecognition,Self-drivingcarslikeUber,Apple&Samsung,Tesla (Lex,2019), Surveillancesystems (Pillai,2021),AppleSiri("Apple,"2019),AmazonAlexa(2019),Etc.…”

Section: Introductionmentioning

confidence: 99%

Generation of Adversarial Mechanisms in Deep Neural Networks

Pavate

Bansode

2022

International Journal of Ambient Computing and Intelligence

View full text Add to dashboard Cite

Deep learning is a subspace of intelligence system learning that experienced prominent results in almost all the application domains. However, Deep Neural Network found to be susceptible to perturbed inputs such that the model generates output other than the expected one. By including insignificant perturbation to the input effectuate computer vision models to make an erroneous prediction. Though, it is still a dilemma whether humans are prone to comparable errors. In this paper, we focus on this issue by leveraging the latest practices that help to generate adversarial examples in computer vision applications by considering diverse identified parameters, unidentified parameters, and architectures. The analysis of the distinct techniques has been done by considering different common parameters. Adversarial examples are easily transferable while designing computer vision applications that control the condition of the classifications of labels. The finding highlights that some methods like Zoo and Deepfool achieved 100% success for the nontargeted attack but are application-specific.

show abstract

Progressive loss functions for speech enhancement with deep neural networks

Cited by 18 publications

References 25 publications

TASE: Task-Aware Speech Enhancement for Wake-Up Word Detection in Voice Assistants

TASE: Task-Aware Speech Enhancement for Wake-Up Word Detection in Voice Assistants

Ideal Ratio Mask Estimation using Supervised DNN Approach for Target Speech Signal Enhancement

Generation of Adversarial Mechanisms in Deep Neural Networks

Contact Info

Product

Resources

About