This paper reports the results of our experiments on speaker identification in the SCOTUS corpus, which includes oral arguments from the Supreme Court of the United States. Our main findings are as follows: 1) a combination of Gaussian mixture models and monophone HMM models attains near-100% textindependent identification accuracy on utterances that are longer than one second; 2) the sampling rate of 11025 Hz achieves the best performance (higher sampling rates are harmful); and a sampling rate as low as 2000 Hz still achieves more than 90% accuracy; 3) a distance score based on likelihood numbers was used to measure the variability of phones among speakers; we found that the most variable phone is the phone UH (as in good), and the velar nasal NG is more variable than the other two nasal sounds M and N; 4.) our models achieved "perfect" forced alignment on very long speech segments (40 minutes). These findings and their significance are discussed.
We report the status of the UNIPEN project of data
Abstract'Linguistic annotation' covers any descriptive or analytic notations applied to raw language data. The basic data may be in the form of time functions -audio, video and/or physiological recordings -or it may be textual. The added notations may include transcriptions of all sorts (from phonetic features to discourse structures), part-of-speech and sense tagging, syntactic analysis, 'named entity' identification, co-reference annotation, and so on. While there are several ongoing efforts to provide formats and tools for such annotations and to publish annotated linguistic databases, the lack of widely accepted standards is becoming a critical problem. Proposed standards, to the extent they exist, have focussed on file formats. This paper focuses instead on the logical structure of linguistic annotations. We survey a wide variety of existing annotation formats and demonstrate a common conceptual core, the annotation graph. This provides a formal framework for constructing, maintaining and searching linguistic annotations, while remaining consistent with many alternative data structures and file formats.
This paper introduces the second DIHARD challenge, the second in a series of speaker diarization challenges intended to improve the robustness of diarization systems to variation in recording equipment, noise conditions, and conversational domain. The challenge comprises four tracks evaluating diarization performance under two input conditions (single channel vs. multi-channel) and two segmentation conditions (diarization from a reference speech segmentation vs. diarization from scratch). In order to prevent participants from overtuning to a particular combination of recording conditions and conversational domain, recordings are drawn from a variety of sources ranging from read audiobooks to meeting speech, to child language acquisition recordings, to dinner parties, to web video. We describe the task and metrics, challenge design, datasets, and baseline systems for speech enhancement, speech activity detection, and diarization. 1 See, for instance, the release of IBM's diarization API in 2017. The feature worked well for simple cases, but when run by users on real inputs, the performance was found to be lacking, especially for overlaps, back-channels, and short turns.
BackgroundAutism spectrum disorder (ASD) is diagnosed more frequently in boys than girls, even when girls are equally symptomatic. Cutting-edge behavioral imaging has detected “camouflaging” in girls with ASD, wherein social behaviors appear superficially typical, complicating diagnosis. The present study explores a new kind of camouflage based on language differences. Pauses during conversation can be filled with words like UM or UH, but research suggests that these two words are pragmatically distinct (e.g., UM is used to signal longer pauses, and may correlate with greater social communicative sophistication than UH). Large-scale research suggests that women and younger people produce higher rates of UM during conversational pauses than do men and older people, who produce relatively more UH. Although it has been argued that children and adolescents with ASD use UM less often than typical peers, prior research has not included sufficient numbers of girls to examine whether sex explains this effect. Here, we explore UM vs. UH in school-aged boys and girls with ASD, and ask whether filled pauses relate to dimensional measures of autism symptom severity.MethodsSixty-five verbal school-aged participants with ASD (49 boys, 16 girls, IQ estimates in the average range) participated, along with a small comparison group of typically developing children (8 boys, 9 girls). Speech samples from the Autism Diagnostic Observation Schedule were orthographically transcribed and time-aligned, with filled pauses marked. Parents completed the Social Communication Questionnaire and the Vineland Adaptive Behavior Scales.ResultsGirls used UH less often than boys across both diagnostic groups. UH suppression resulted in higher UM ratios for girls than boys, and overall filled pause rates were higher for typical children than for children with ASD. Higher UM ratios correlated with better socialization in boys with ASD, but this effect was driven by increased use of UH by boys with greater symptoms.ConclusionsPragmatic language markers distinguish girls and boys with ASD, mirroring sex differences in the general population. One implication of this finding is that typical-sounding disfluency patterns (i.e., reduced relative UH production leading to higher UM ratios) may normalize the way girls with ASD sound relative to other children, serving as “linguistic camouflage” for a naïve listener and distinguishing them from boys with ASD. This first-of-its-kind study highlights the importance of continued commitment to understanding how sex and gender change the way that ASD manifests, and illustrates the potential of natural language to contribute to objective “behavioral imaging” diagnostics for ASD.
ObjectiveTo automatically extract and quantify specific disease biomarkers of prosody from the acoustic properties of speech in patients with primary progressive aphasia.MethodsWe analyzed speech samples from 59 progressive aphasic patients (non‐fluent/agrammatic = 15, semantic = 21, logopenic = 23; ages 50–85 years) and 31 matched healthy controls (ages 54–89 years). Using a novel, automated speech analysis protocol, we extracted acoustic measurements of prosody, including fundamental frequency and speech and silent pause durations, and compared these between groups. We then examined their relationships with clinical tests, gray matter atrophy, and cerebrospinal fluid analytes.ResultsWe found a narrowed range of fundamental frequency in patients with non‐fluent/agrammatic variant aphasia (mean 3.86 ± 1.15 semitones) compared with healthy controls (6.06 ± 1.95 semitones; P < 0.001) and patients with semantic variant aphasia (6.12 ± 1.77 semitones; P = 0.001). Mean pause rate was significantly increased in the non‐fluent/agrammatic group (mean 61.4 ± 20.8 pauses per minute) and the logopenic group (58.7 ± 16.4 pauses per minute) compared to controls. In an exploratory analysis, narrowed fundamental frequency range was associated with atrophy in the left inferior frontal cortex. Cerebrospinal level of phosphorylated tau was associated with an acoustic classifier combining fundamental frequency range and pause rate (r = 0.58, P = 0.007). Receiver operating characteristic analysis with this combined classifier distinguished non‐fluent/agrammatic speakers from healthy controls (AUC = 0.94) and from semantic variant patients (AUC = 0.86).InterpretationRestricted fundamental frequency range and increased pause rate are characteristic markers of speech in non‐fluent/agrammatic primary progressive aphasia. These can be extracted with automated speech analysis and are associated with left inferior frontal atrophy and cerebrospinal phosphorylated tau level.
In this study, we investigate cross-linguistic patterns in the alternation between UM, a hesitation marker consisting of a neutral vowel followed by a final labial nasal, and UH, a hesitation marker consisting of a neutral vowel in an open syllable. Based on a quantitative analysis of a range of spoken and written corpora, we identify clear and consistent patterns of change in the use of these forms in various Germanic languages (English, Dutch, German, Norwegian, Danish, Faroese) and dialects (American English, British English), with the use of UM increasing over time relative to the use of UH. We also find that this pattern of change is generally led by women and more educated speakers. Finally, we propose a series of possible explanations for this surprising change in hesitation marker usage that is currently taking place across Germanic languages.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.