Highlights New software paradigm for linguistic/phonetic tools: webservices Webservices encapsulating basic processing tools Webservices as building blocks for complex systems Web interface as front end to webservices or systems of webservices BAS CLARIN webservices: a free service to the scientific community Multilingual automatic segmentation and labelling of speech into words and phones Multilingual automatic text-to-phoneme conversion webservice Multilingual syllabification webservice Free German speech synthesis webservice The services include automatic segmentation of speech, grapheme-to-phoneme conversion, syllabification, speech synthesis, and optimal symbol sequence alignment.
We examine prosodic entrainment in cooperative game dialogs for new feature sets describing register, pitch accent shape, and rhythmic aspects of utterances. For these as well as for established features we present entrainment profiles to detect within-and across-dialog entrainment by the speakers' gender and role in the game. It turned out, that feature sets undergo entrainment in different quantitative and qualitative ways, which can partly be attributed to their different functions. Furthermore, interactions between speaker gender and role (describer vs. follower) suggest gender-dependent strategies in cooperative solution-oriented interactions: female describers entrain most, male describers least. Our data suggests a slight advantage of the latter strategy on task success.
In this contribution, we investigate the effectiveness of deep fusion of text and audio features for categorical and dimensional speech emotion recognition (SER). We propose a novel, multistage fusion method where the two information streams are integrated in several layers of a deep neural network (DNN), and contrast it with a single-stage one where the streams are merged in a single point. Both methods depend on extracting summary linguistic embeddings from a pre-trained BERT model, and conditioning one or more intermediate representations of a convolutional model operating on log-Mel spectrograms. Experiments on the widely used IEMOCAP and MSP-Podcast databases demonstrate that the two fusion methods clearly outperform a shallow (late) fusion baseline and their unimodal constituents, both in terms of quantitative performance and qualitative behaviour. Our accompanying analysis further reveals a hitherto unexplored role of the underlying dialogue acts on unimodal and bimodal SER, with different models showing a biased behaviour across different acts. Overall, our multistage fusion shows better quantitative performance, surpassing all alternatives on most of our evaluations. This illustrates the potential of multistage fusion in better assimilating text and audio information.
Many past studies have sought to determine the factors that affect f0 declination, and the physiological underpinnings of the phenomenon. This study assessed the relation between respiration and f0 declination by means of simultaneous acoustic and respiratory recordings from read and spontaneous speech from speakers of German. Within the respective
We examined how well prosodic boundary strength can be captured by two declination stylization methods as well as by four different representations of pitch register. In the stylization proposed by Liebermann et al. (1985) base-and topline are fitted to peaks and valleys of the pitch contour, whereas in Rei-chel&Mády (2013) these lines are fitted to medians below and above certain pitch percentiles. From each of the stylizations four feature pools were induced representing different aspects of register discontinuity at word boundaries: discontinuities related to the base-, mid-, and topline, as well as to the range between base-and topline. Concerning stylization the medianbased fitting approach turned out to be more robust with respect to declination line crossing errors and yielded base-, topline and range-related discontinuity characteristics with higher correlations to perceived boundary strength. Concerning register representation, for the peak/valley fitting approach the base-and topline patterns showed weaker correspondences to boundary strength than the other feature pools. We furthermore trained generalized linear regression models for boundary strength prediction on each feature pool. It turned out that neither the stylization method nor the register representation had a significant influence on the overall good prediction performance.
With the COVID-19 pandemic, several research teams have reported successful advances in automated recognition of COVID-19 by voice. Resulting voice-based screening tools for COVID-19 could support large-scale testing efforts. While capabilities of machines on this task are progressing, we approach the so far unexplored aspect whether human raters can distinguish COVID-19 positive and negative tested speakers from voice samples, and compare their performance to a machine learning baseline. To account for the challenging symptom similarity between COVID-19 and other respiratory diseases, we use a carefully balanced dataset of voice samples, in which COVID-19 positive and negative tested speakers are matched by their symptoms alongside COVID-19 negative speakers without symptoms. Both human raters and the machine struggle to reliably identify COVID-19 positive speakers in our dataset. These results indicate that particular attention should be paid to the distribution of symptoms across all speakers of a dataset when assessing the capabilities of existing systems. The identification of acoustic aspects of COVID-19-related symptom manifestations might be the key for a reliable voice-based COVID-19 detection in the future by both trained human raters and machine learning models.
In Hungarian intonation research the goal of a common framework developed by Varga (2002; [1]) is to categorize the intonation within the domain of accent groups by character contours. We propose a linear parameterization of a subset of these contours derived from polynomial stylization. These parameters were used to train classification trees and support vector machines for contour prediction. Parameter extraction and training was carried out on the original F0 contours of spontaneous speech data as well as on three differently normalized variants suppressing fundamental frequency level and range effects. The highest accuracies were obtained for classification trees and F0 residuals after midline subtraction, but the overall performances were rather poor. Nevertheless, a significant improvement of the results was achieved by a Hidden Markov model to predict the correct label sequence from the partly erroneous classification output.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.