Highlights New software paradigm for linguistic/phonetic tools: webservices Webservices encapsulating basic processing tools Webservices as building blocks for complex systems Web interface as front end to webservices or systems of webservices BAS CLARIN webservices: a free service to the scientific community Multilingual automatic segmentation and labelling of speech into words and phones Multilingual automatic text-to-phoneme conversion webservice Multilingual syllabification webservice Free German speech synthesis webservice The services include automatic segmentation of speech, grapheme-to-phoneme conversion, syllabification, speech synthesis, and optimal symbol sequence alignment.
Abstract. This article presents a modular, distributed and scalable many-camera system designed towards tracking multiple people simultaneously in a natural human-robot interaction scenario set in an apartment mock-up. The described system employs 40 high-resolution cameras networked to 15 computers, redundantly covering an area of approximately 100 square meters. The unique scale and set-up of the system require novel approaches for vision-based tracking, especially with respect to the transfer of targets between the different tracking processes while preserving the target identities. We propose an integrated approach to cope with these challenges, and focus on the system architecture, the target information management, the calibration of the cameras and the applied tracking methodologies themselves.
This study is a contribution to link the abstract phonological level to the acoustic signal level by identifying the main acoustic correlates for the distinctive feature set developed by Chomsky and Halle (1968). The acoustic features were extracted by the openSMILE toolkit from spontaneous speech data. For each distinctive feature a set of closely related acoustic features was derived by means of correlation-based feature selection. Based on the respective acoustic feature pools C4.5 trees and support vector machines for binary feature classification were trained. The classification performance ranged from 76 to 89% for vocalic features and from 78 to 93% for consonantal features. The methods proposed in this study can be of use to identify systematic speech signal correspondencies for phonological models and as a starting point for distinctive feature detection in speech recognition.
Many classifiers struggle when confronted with a high dimensional feature space like in the data sets provided for the Interspeech ComParE challenge. This is because most features do not significantly contribute to the prediction. To alleviate this problem, we propose a feature selection based on a Genetic Algorithm (GA) that uses an SVM as the fitness function. We show that this yields a reduced subset (1) which results in an Unweighted Average Recall (UAR) that beats the challenge baseline on the development set for the 3-class classification problem. Further, we extract an additional per-phoneme feature set, where the features are inspired by the ComParE features. On this set the same GA-based feature selection is performed and the resulting set is used for training in isolation (2) and in combination with the aforementioned reduced challenge features (3). Five classifiers were tested on the three subsets, namely SVMs, DNNs, GBMs, RFs, and regularized regression. All classifiers achieved a UAR above the baseline on all three sets. The best performance on set (1) was achieved by an SVM using an RBF kernel and on sets (2) and (3) by a fusion of classifiers.
Spontaneous speech produced in sober and intoxicated conditions has been compared in information theoretic terms on the phoneme and word level to examine phonological and lexical aspects of intoxication. Word level entropy has been calculated to capture roughly the effect of alcohol on cognitive lexical creativity. Phoneme level entropy is intended to reflect heavy tongue influences on phoneme combinations. Moreover, mispronunciations have been investigated by relating canonical to realised pronunciation by means of mutual information and the Levenshtein distance. To account for the gradual nature of intoxication, examinations have been carried out regarding the offsets and slopes of linear functions mapping the blood alcohol concentration to the information theoretic variables. It turned out that male speakers compensate less for the alcohol-induced degradations with regard to lexical creativity and articulatory precision than female speakers. Furthermore, the pronunciation of male speakers generally deviates more from canonical forms.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.