There exist many acoustic parameters employed for pathological assessment tasks, which have served as tools for clinicians to distinguish between normophonic and pathological voices. However, many of these parameters require an appropriate tuning in order to maximize its efficiency. In this work, a group of new and already proposed modulation spectrum (MS) metrics are optimized considering different time and frequency ranges pursuing the maximization of efficiency for the detection of pathological voices. The optimization of the metrics is performed simultaneously in two different voice databases in order to identify what tuning ranges produce a better generalization. The experiments were cross-validated so as to ensure the validity of the results. A third database is used to test the optimized metrics. In spite of some differences, results indicate that the behavior of the metrics in the optimization process follows similar tendencies for the tuning databases, confirming the generalization capabilities of the proposed MS metrics. In addition, the tuning process reveals which bands of the modulation spectra have relevant information for each metric, which has a physical interpretation respecting the phonatory system. Efficiency values up to 90.6% are obtained in one tuning database, while in the other, the maximum efficiency reaches 71.1%. Obtained results also evidence a separability between normophonic and pathological states using the proposed metrics, which can be exploited for voice pathology detection or assessment.
Only a handful of the world's languages are abundant with the resources that enable practical applications of speech processing technologies. One of the methods to overcome this problem is to use the resources existing in other languages to train a multilingual automatic speech recognition (ASR) model, which, intuitively, should learn some universal phonetic representations. In this work, we focus on gaining a deeper understanding of how general these representations might be, and how individual phones are getting improved in a multilingual setting. To that end, we select a phonetically diverse set of languages, and perform a series of monolingual, multilingual and crosslingual (zero-shot) experiments. The ASR is trained to recognize the International Phonetic Alphabet (IPA) token sequences. We observe significant improvements across all languages in the multilingual setting, and stark degradation in the crosslingual setting, where the model, among other errors, considers Javanese as a tone language. Notably, as little as 10 hours of the target language training data tremendously reduces ASR error rates. Our analysis uncovered that even the phones that are unique to a single language can benefit greatly from adding training data from other languages -an encouraging result for the lowresource speech community.
Automatic detection of phoneme or word-like units is one of the core objectives in zero-resource speech processing. Recent attempts employ self-supervised training methods, such as contrastive predictive coding (CPC), where the next frame is predicted given past context. However, CPC only looks at the audio signal's frame-level structure. We overcome this limitation with a segmental contrastive predictive coding (SCPC) framework that can model the signal structure at a higher level e.g. at the phoneme level. In this framework, a convolutional neural network learns frame-level representation from the raw waveform via noise-contrastive estimation (NCE). A differentiable boundary detector finds variable-length segments, which are then used to optimize a segment encoder via NCE to learn segment representations. The differentiable boundary detector allows us to train frame-level and segment-level encoders jointly. Typically, phoneme and word segmentation are treated as separate tasks. We unify them and experimentally show that our single model outperforms existing phoneme and word segmentation methods on TIMIT and Buckeye datasets. We analyze the impact of boundary threshold and when is the right time to include the segmental loss in the learning process.
Disordered voices are frequently assessed by speech pathologists using perceptual evaluations. This might lead to problems caused by the subjective nature of the process and due to the influence of external factors which compromise the quality of the assessment. In order to increase the reliability of the evaluations, the design of automatic evaluation systems is desirable. With that in mind, this paper presents an automatic system which assesses the Grade and Roughness level of the speech according to the GRBAS perceptual scale. Two parameterization methods are used: one based on the classic Mel-Frequency Cepstral Coefficients, which has already been used successfully in previous works, and other derived from modulation spectra. For the latter, a new group of parameters has been proposed, named Modulation Spectra Morphological Parameters: MSC, DRB, LMR, MSH, MSW, CIL, PALA, and RALA. In methodology, PCA and LDA are employed to reduce the dimensionality of feature space, and GMM classifiers to evaluate the ability of the proposed features on distinguishing the different levels. Efficiencies of 81.6% and 84.7% are obtained for Grade and Roughness, respectively, using modulation spectra parameters, while MFCCs performed 80.5% and 77.7%. The obtained results suggest the usefulness of the proposed Modulation Spectra Morphological Parameters for automatic evaluation of Grade and Roughness in the speech.
Although a large amount of acoustic indicators have already been proposed in the literature to evaluate the hypokinetic dysarthria of people with Parkinson’s Disease, the goal of this work is to identify and interpret new reliable and complementary articulatory biomarkers that could be applied to predict/evaluate Parkinson’s Disease from a diadochokinetic test, contributing to the possibility of a further multidimensional analysis of the speech of parkinsonian patients. The new biomarkers proposed are based on the kinetic behaviour of the envelope trace, which is directly linked with the articulatory dysfunctions introduced by the disease since the early stages. The interest of these new articulatory indicators stands on their easiness of identification and interpretation, and their potential to be translated into computer based automatic methods to screen the disease from the speech. Throughout this paper, the accuracy provided by these acoustic kinetic biomarkers is compared with the one obtained with a baseline system based on speaker identification techniques. Results show accuracies around 85% that are in line with those obtained with the complex state of the art speaker recognition techniques, but with an easier physical interpretation, which open the possibility to be transferred to a clinical setting.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.