Biometric systems are exposed to spoofing attacks which may compromise their security, and voice biometrics based on automatic speaker verification (ASV), is no exception. To increase the robustness against such attacks, anti-spoofing systems have been proposed for the detection of replay, synthesis and voice conversion-based attacks. However, most proposed antispoofing techniques are loosely integrated with the ASV system. In this work, we develop a new integration neural network which jointly processes the embeddings extracted from ASV and antispoofing systems in order to detect both zero-effort impostors and spoofing attacks. Moreover, we propose a new loss function based on the minimization of the area under the expected (AUE) performance and spoofability curve (EPSC), which allows us to optimize the integration neural network on the desired operating range in which the biometric system is expected to work. To evaluate our proposals, experiments were carried out on the recent ASVspoof 2019 corpus, including both logical access (LA) and physical access (PA) scenarios. The experimental results show that our proposal clearly outperforms some well-known techniques based on the integration at the score-and embeddinglevel. Specifically, our proposal achieves up to 23.62% and 22.03% relative equal error rate (EER) improvement over the best performing baseline in the LA and PA scenarios, respectively, as well as relative gains of 27.62% and 29.15% on the AUE metric.Index Terms-Automatic speaker verification (ASV), spoofing detection, embeddings, integration of ASV and anti-spoofing, expected performance and spoofability curve (EPSC).
During depression neurophysiological changes can occur, which may affect laryngeal control i.e. behaviour of the vocal folds. Characterising these changes in a precise manner from speech signals is a non trivial task, as this typically involves reliable separation of the voice source information from them. In this paper, by exploiting the abilities of CNNs to learn task-relevant information from the input raw signals, we investigate several methods to model voice source related information for depression detection. Specifically, we investigate modelling of low pass filtered speech signals, linear prediction residual signals, homomorphically filtered voice source signals and zero frequency filtered signals to learn voice source related information for depression detection. Our investigations show that subsegmental level modelling of linear prediction residual signals or zero frequency filtered signals leads to systems better than the state-of-the-art low level descriptor based systems and deep learning based systems modelling the vocal tract system information.
This paper addresses the Styrian Dialect sub-challenge of the INTERSPEECH 2019 Computational Paralinguistics Challenge. We treat this challenge as dialect identification with no linguistic resources/knowledge and with limited acoustic resources, and develop end-to-end raw waveform modelling based methods that incorporate knowledge related to speech production. In this direction, we investigate two methods: (a) modelling the signals after source system decomposition and (b) transferring knowledge from articulatory feature models trained on English language. Our investigations show that the proposed approaches on the ComParE 2019 Styrian dialect data yield systems that perform better than low level descriptorbased and bag-of-audio-word representation based approaches and comparable to sequence-to-sequence auto-encoder based approach.
Speech-based degree of sleepiness estimation is an emerging research problem. In the literature, this problem has been mainly addressed through modeling of low level of descriptors. This paper investigates an end-to-end approach, where given raw waveform as input, a neural network estimates at its output the degree of sleepiness. Through an investigation on the continuous sleepiness subchallenge of the INTERSPEECH 2019 Computational Paralinguistics Challenge, we show that the proposed approach consistently yields performance comparable or better than low level descriptor-based, bag-of-audio-words-based and sequence-to-sequence autoencoder feature representation-based regression systems. Furthermore, a confusion matrix analysis on the development set shows that, unlike the best baseline system, the performance of our approach is not centering around a few degrees of sleepiness, but is spread across all the degrees of sleepiness.
Children speech recognition based on short-term spectral features is a challenging task. One of the reasons is that children speech has high fundamental frequency that is comparable to formant frequency values. Furthermore, as children grow, their vocal apparatus also undergoes changes. This presents difficulties in extracting standard short-term spectral-based features reliably for speech recognition. In recent years, novel acoustic modeling methods have emerged that learn both the feature and phone classifier in an end-to-end manner from the raw speech signal. Through an investigation on PF-STAR corpus we show that children speech recognition can be improved using end-to-end acoustic modeling methods.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.