This paper presents a large-scale study of subglottal resonances (SGRs) (the resonant frequencies of the tracheo-bronchial tree) and their relations to various acoustical and physiological characteristics of speakers. The paper presents data from a corpus of simultaneous microphone and accelerometer recordings of consonant-vowel-consonant (CVC) words embedded in a carrier phrase spoken by 25 male and 25 female native speakers of American English ranging in age from 18 to 24 yr. The corpus contains 17 500 utterances of 14 American English monophthongs, diphthongs, and the rhotic approximant [ò] in various CVC contexts. Only monophthongs are analyzed in this paper. Speaker height and age were also recorded. Findings include (1) normative data on the frequency distribution of SGRs for young adults, (2) the dependence of SGRs on height, (3) the lack of a correlation between SGRs and formants or the fundamental frequency, (4) a poor correlation of the first SGR with the second and third SGRs but a strong correlation between the second and third SGRs, and (5) a significant effect of vowel category on SGR frequencies, although this effect is smaller than the measurement standard deviations and therefore negligible for practical purposes.
Previous studies of subglottal resonances have reported findings based on relatively few subjects, and the relations between these resonances, subglottal anatomy, and models of subglottal acoustics are not well understood. In this study, accelerometer signals of subglottal acoustics recorded during sustained [a:] vowels of 50 adult native speakers (25 males, 25 females) of American English were analyzed. The study confirms that a simple uniform tube model of subglottal airways, closed at the glottis and open at the inferior end, is appropriate for describing subglottal resonances. The main findings of the study are (1) whereas the walls may be considered rigid in the frequency range of Sg2 and Sg3, they are yielding and resonant in the frequency range of Sg1, with a resulting ~4/3 increase in wave propagation velocity and, consequently, in the frequency of Sg1; (2) the "acoustic length" of the equivalent uniform tube varies between 18 and 23.5 cm, and is approximately equal to the height of the speaker divided by an empirically determined scaling factor; (3) trachea length can also be predicted by dividing height by another empirically determined scaling factor; and (4) differences between the subglottal resonances of males and females can be accounted for by height-related differences.
This paper offers a re-evaluation of the mechanical properties of the tracheo-bronchial soft tissues and cartilage and uses a model to examine their effects on the subglottal acoustic input impedance. It is shown that the values for soft tissue elastance and cartilage viscosity typically used in models of subglottal acoustics during phonation are not accurate, and corrected values are proposed. The calculated subglottal acoustic input impedance using these corrected values reveals clusters of weak resonances due to soft tissues (SgT) and cartilage (SgC) lining the walls of the trachea and large bronchi, which can be observed empirically in subglottal acoustic spectra. The model predicts that individuals may exhibit SgT and SgC resonances to variable degrees, depending on a number of factors including tissue mechanical properties and the dimensions of the trachea and large bronchi. Potential implications for voice production and large pulmonary airway tissue diseases are also discussed.
Multilingual ASR technology simplifies model training and deployment, but its accuracy is known to depend on the availability of language information at runtime. Since language identity is seldom known beforehand in real-world scenarios, it must be inferred on-the-fly with minimum latency. Furthermore, in voice-activated smart assistant systems, language identity is also required for downstream processing of ASR output. In this paper, we introduce streaming, end-to-end, bilingual systems that perform both ASR and language identification (LID) using the recurrent neural network transducer (RNN-T) architecture. On the input side, embeddings from pretrained acousticonly LID classifiers are used to guide RNN-T training and inference, while on the output side, language targets are jointly modeled with ASR targets. The proposed method is applied to two language pairs: English-Spanish as spoken in the United States, and English-Hindi as spoken in India. Experiments show that for English-Spanish, the bilingual joint ASR-LID architecture matches monolingual ASR and acoustic-only LID accuracies. For the more challenging (owing to within-utterance code switching) case of English-Hindi, English ASR and LID metrics show degradation. Overall, in scenarios where users switch dynamically between languages, the proposed architecture offers a promising simplification over running multiple monolingual ASR models and an LID classifier in parallel.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.