Five speech-language clinicians and 5 naive listeners rated the similarity of pairs of normal and dysphonic voices. Multidimensional scaling was used to determine the voice characteristics that were perceptually important for each voice set and listener group. Solution spaces were compared to determine if clinical experience affects perceptual strategies. Naive and expert listeners attended to different aspects of voice quality when judging the similarity of voices, for both normal and pathological voices. All naive listeners used similar perceptual strategies; however, individual clinicians differed substantially in the parameters they considered important when judging similarity. These differences were large enough to suggest that care must be taken when using data averaged across clinicians, because averaging obscures important aspects of an individual’s perceptual behavior.
Sixteen listeners (10 expert, 6 naive) judged the dissimilarity of pairs of voices drawn from pathological and normal populations. Separate nonmetric multidimensional scaling solutions were calculated for each listener and voice set. The correlations between individual listeners' dissimilarity ratings were low However, scaling solutions indicated that each subject judged the voices in a reliable, meaningful way. Listeners differed more from one another in their judgments of the pathological voices (which varied widely on a number of acoustic parameters) than they did for the normal voices (which formed a much more homogeneous set acoustically). The acoustic features listeners used to judge dissimilarity were predictable from the characteristics of the stimulus sets' only parameters that showed substantial variability were perceptually salient across listeners. These results are consistent with prototype models of voice perception They suggest that traditional means of assessing listener reliability n voice perception tasks may not be appropriate, and highlight the importance of using explicit comparisons between stimuli when studying voice quality perception
Sixteen listeners (10 expert, 6 naive) judged the dissimilarity of pairs of voices drawn from pathological and normal populations. Separate nonmetric multidimensional scaling solutions were calculated for each listener and voice set. The correlations between individual listeners’ dissimilarity ratings were low However, scaling solutions indicated that each subject judged the voices in a reliable, meaningful way. Listeners differed more from one another in their judgments of the pathological voices (which varied widely on a number of acoustic parameters) than they did for the normal voices (which formed a much more homogeneous set acoustically). The acoustic features listeners used to judge dissimilarity were predictable from the characteristics of the stimulus sets’ only parameters that showed substantial variability were perceptually salient across listeners. These results are consistent with prototype models of voice perception They suggest that traditional means of assessing listener reliability n voice perception tasks may not be appropriate, and highlight the importance of using explicit comparisons between stimuli when studying voice quality perception
SRI International’s EduSpeak® system is a software development toolkit that enables developers of interactive language education software to use state-of-the-art speech recognition and pronunciation scoring technology. Automatic pronunciation scoring allows the computer to provide feedback on the overall quality of pronunciation and to point to specific production problems. We review our approach to pronunciation scoring, where our aim is to estimate the grade that a human expert would assign to the pronunciation quality of a paragraph or a phrase. Using databases of nonnative speech and corresponding human ratings at the sentence level, we evaluate different machine scores that can be used as predictor variables to estimate pronunciation quality. For more specific feedback on pronunciation, the EduSpeak toolkit supports a phone-level mispronunciation detection functionality that automatically flags specific phone segments that have been mispronounced. Phone-level information makes it possible to provide the student with feedback about specific pronunciation mistakes.Two approaches to mispronunciation detection were evaluated in a phonetically transcribed database of 130,000 phones uttered in continuous speech sentences by 206 nonnative speakers. Results show that classification error of the best system, for the phones that can be reliably transcribed, is only slightly higher than the average pairwise disagreement between the human transcribers.
UPSID—the UCLA phonological segment inventory database—is a database containing the phoneme inventories of a large genetically based sample of languages [I. Maddieson, Patterns of Sounds (1984)]. Each phoneme is specified in terms of a comprehensive set of phonetic features. The first version of the database has proven useful to scholars interested in phonological universals and theories concerning the structure of phonological systems [e.g., B. Lindblom and I. Maddieson, in Language, Speech & Mind, edited by L. M. Hyman and C. N. Li (1988); K. Stevens and S. J. Keyset, Language 65, 81–106 (1989)]. An expanded and corrected second version is currently in preparation. This version improves the sample, increasing coverage of previously undersampled language families and correcting a few oversampling errors, and correcting errors in individual language inventories. A new custom-written software package for MS-DOS systems provides economical and flexible means of storing and modifying this enhanced database and outputting subsets of the data for further analysis. The database is stored as several separate but interrelated modules. One contains a listing of character codes for each distinct segment type occurring in the database paired with a standard phonetic description and with the list of features assigned to that segment. Another contains the phoneme inventories as a set of character codes for each language. The database is principally used by mating information from these two modules, for example, by creating a file containing the fully specified feature descriptions of the segments in a group of languages, or all segments defined by a selected set of feature values. Such files can be exported to a standard statistics package for sophisticated processing, but simpler counting operations can be performed within the UPSID program.
We present supervised approaches for detecting speaker roles and agreement/disagreement between speakers in broadcast conver sation shows in three languages: English, Arabic, and Mandarin. We develop annotation approaches for a variety of linguistic phenom ena. Various lexical, structural, and social network analysis based features are explored, and feature importance is analyzed across the three languages. We also compare the performance when using fea tures extracted from automatically generated annotations against that when using human annotations. The algorithms achieve speaker role labeling accuracy of more than 86% for all three languages. For agreement and disagreement detection, the algorithms achieve pre cision of 63% to 92% and 55% to 85%, respectively, across the three languages.Index Terms-speaker role labeling, agreement and disagree ment, broadcast conversation, feature analysis IN TRODUCTIONIn recent years, much research has aimed at developing systems for automatically analyzing the large volume of broadcast speech (for example, the recent DARPA GALE program ! ). Under GALE, broadcast news (BN) and broadcast conversation (BC) audio was collected. The BN genre consists of "talking head" style broadcasts, Le., generally one person reading a news script. The BC genre is more interactive and spontaneous, referring to free-flowing speech in news-style TV and radio programs and consisting of talk shows, interviews, call-in programs, live reports, and round-tables. In past years, systems have been built to perform automatic transcription, speaker diarization, story segmentation, summarization, etc, espe cially for the BN data.Speaker roles can provide useful structural information of broad cast audio data for applications such as spoken document retrieval, summarization, or question answering. The initial work on speaker role classification was focused only on the BN data. In [1], speakers in BN shows were categorized into three categories: anchor, journal ist, and guest and 80% classification accuracy was achieved on En glish BN automatic speech recognition (ASR) transcriptions. Liu [2] developed an HMM-based approach and a maximum entropy model for three-way speaker role labeling (Le., anchor, reporter, or other) using Mandarin BN speech. The algorithm achieved classification accuracy of about 80% using the human transcriptions and manu ally labeled speaker turns. Recently, Hutchinson et a!. [3] studied speaker role labeling in English and Mandarin BC audio data as an t This work was perfonned while the author was at res!.! The goal of the GALE program is to develop computer software tech niques to analyze, interpret, and distill infonnation from speech and text in multiple languages.978-1-4577-0539-71111$26.00 ©2011 IEEE 5556unsupervised learning task, in order to avoid the cost associated with manual annotations and to explore the large amount of unlabeled BC data. In this study, speakers were classified into three roles: hosts, expert guests (e.g., journalists, panelists, interviewees), and sound bi...
Bolinger, Ohala, Morton and others have established that vocal pitch height is perceived to be associated with social signals of dominance and submissiveness: higher vocal pitch is associated with submissiveness, whereas lower vocal pitch is associated with social dominance. An experiment was carried out to test this relationship in the perception of non-vocal melodies. Results show a parallel situation in music: higher-pitched melodies sound more submissive (less threatening) than lower-pitched melodies.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.