We present the Montreal Forced Aligner (MFA), a new opensource system for speech-text alignment. MFA is an update to the Prosodylab-Aligner, and maintains its key functionality of trainability on new data, as well as incorporating improved architecture (triphone acoustic models and speaker adaptation), and other features. MFA uses Kaldi instead of HTK, allowing MFA to be distributed as a stand-alone package, and to exploit parallel processing for computationally-intensive training and scaling to larger datasets. We evaluate MFA's performance on aligning word and phone boundaries in English conversational and laboratory speech, relative to human-annotated boundaries, focusing on the effects of aligner architecture and training on the data to be aligned. MFA performs well relative to two existing open-source aligners with simpler architecture (Prosodylab-Aligner and FAVE), and both its improved architecture and training on data to be aligned generally result in more accurate boundaries.
This paper reports three studies aimed at addressing three questions about the acoustic correlates of information structure in English: (1) do speakers mark information structure prosodically, and, to the extent they do, (2) what are the acoustic features associated with different aspects of information structure, and (3) how well can listeners retrieve this information from the signal? The information structure of subject-verb-object (SVO) sentences was manipulated via the questions preceding those sentences: elements in the target sentences were either focused (i.e. the answer to a wh-question) or given (i.e. mentioned in prior discourse); furthermore, focused elements had either an implicit or an explicit contrast set in the discourse; finally, either only the object was focused (narrow object focus) or the entire event was focused (wide focus). The results across all three experiments demonstrated that people reliably mark (a) focus location (subject, verb, or object) using greater intensity, longer duration, and higher mean and maximum F0, and (b) focus breadth, such that narrow object focus is marked with greater intensity, longer duration, and higher mean and maximum F0 on the object than wide focus. Furthermore, when participants are made aware of prosodic ambiguity present across different information structures, they reliably mark focus type, so that contrastively-focused elements are produced with higher intensity, longer duration, and lower mean and maximum F0 than non-contrastively focused elements. In addition to having important theoretical consequences for accounts of semantics and prosody, these experiments demonstrate that linear residualization successfully removes individual differences in people's productions thereby revealing cross-speaker generalizations. Furthermore, discriminant modeling allows us to objectively determine the acoustic features that underlie meaning differences.Acoustic correlates of information structure 3
Research on prosody has recently become an important focus in various disciplines, including Linguistics, Psychology, and Computer Science. This article reviews recent research advances on two key issues: prosodic phrasing and prosodic prominence. Both aspects of prosody are influenced by linguistic factors such as syntactic constituent structure, semantic relations, phonological rhythm, pragmatic considerations, and also by processing factors such as the length, complexity or predictability of linguistic material. Our review summarizes recent insights into the production and perception of these two components of prosody and their grammatical underpinnings. While this review only covers a subset of a broader set of research topics on prosody in cognitive science, they are representative of a tendency in the field toward a more interdisciplinary approach.
Clarification of the cortical mechanisms underlying auditory sensory gating may advance our understanding of brain dysfunctions associated with schizophrenia. To this end, data from 9 epilepsy patients who participated in an auditory paired-click paradigm during pre-surgical evaluation and had grids of electrodes covering temporal and frontal lobe were analyzed. A distributed source localization approach was applied to intracranial P50 response and Gating Difference Wave obtained by subtracting the response to second stimuli from the response to first stimuli.Source reconstruction of the P50 showed that the main generators of the response were localized at the temporal lobes. The analysis also suggested that the maximum neuronal activity contributing to the amplitude reduction at the P50 time range (phenomenon of auditory sensory gating) is localized at the frontal lobe.Present findings suggest that while the temporal lobe is the main generator of the P50 component, the frontal lobe seems to be a substantial contributor to the process of sensory gating as observed from scalp recordings.
The standardized Low Resolution Brain Electromagnetic Tomography method (sLORETA) can be used to compute statistical maps from EEG and MEG data that indicate the locations of the underlying source processes with low error. These maps are derived by performing a location-wise inverse weighting of the results of a Minimum Norm Least Squares (MNLS) analysis with their estimated variances. In this contribution, we evaluate the performance of the method under the presence of noise and with multiple, simultaneously active sources. It is shown that the sLORETA method localizes well, as compared to other linear approaches such as MNLS and LORETA. However, simultaneously active sources can only be separated if their fields are distinct enough and of similar strength. In the context of a strong or superficial source, weak or deep sources remain invisible, and nearby sources of similar orientation tend not to be separated but interpreted as one source located roughly in between.
The cortical processing of auditory-alone, visual-alone, and audiovisual speech information is temporally and spatially distributed, and functional magnetic resonance imaging (fMRI) cannot adequately resolve its temporal dynamics. In order to investigate a hypothesized spatio-temporal organization for audiovisual speech processing circuits, event-related potentials (ERPs) were recorded using electroencephalography (EEG). Stimuli were congruent audiovisual /bα/, incongruent auditory /bα/ synchronized with visual /gα/, auditory-only /bα/, and visual-only /bα/ and /gα/. Current density reconstructions (CDRs) of the ERP data were computed across the latency interval of 50-250 milliseconds. The CDRs demonstrated complex spatio-temporal activation patterns that differed across stimulus conditions. The hypothesized circuit that was investigated here comprised initial integration of audiovisual speech by the middle superior temporal sulcus (STS), followed by recruitment of the intraparietal sulcus (IPS), followed by activation of Broca's area (Miller and d'Esposito, 2005). The importance of spatio-temporally sensitive measures in evaluating processing pathways was demonstrated. Results showed, strikingly, early (< 100 msec) and simultaneous activations in areas of the supramarginal and angular gyrus (SMG/AG), the IPS, the inferior frontal gyrus, and the dorsolateral prefrontal cortex. Also, emergent left hemisphere SMG/AG activation, not predicted based on the unisensory stimulus conditions was observed at approximately 160 to 220 msec. The STS was neither the earliest nor most prominent activation site, although it is frequently considered the sine qua non of audiovisual speech integration. As discussed here, the relatively late activity of the SMG/AG solely under audiovisual conditions is a possible candidate audiovisual speech integration response.
This paper argues that generalizations about prosodic phrasing are recursive in nature. Initial evidence comes from the fragment of English consisting only of proper names and and and or. A systematic relation between the semantics, the syntactic combinatorics, and the prosodic phrasing of these coordinate structures can be captured by recursively combining the prosodies (represented as relational metrical grids) of their parts, in tandem with assembling the compositional meaning of the expression. Alternative edge-based approaches to prosodic phrasing fail to capture the recursive nature of the generalization, a result independent of whether or not prosodic representation itself is assumed to be recursive. The presented model is argued to generalize beyond the coordinate fragment, despite two types of apparent counterexamples: Structures that are prosodically flat but syntactically articulated, and structures with an apparent mismatch between prosody and syntax, as epitomized by the famous cat that caught the rat that stole the cheese (Chomsky 1965, Chomsky & Halle 1968). Closer inspection reveals that the syntax might actually be quite in tune with prosody in both cases. * This chapter is based on two chapters from my MIT dissertation (Wagner 2005b). In the meantime, I have received helpful feedback from Asaf Bachrach, John Bowers, Wayles Brown, Abby Cohn, Jon Gajewski, Mats Rooth, an anonymous reviewer for the Cornell working papers in Linguistics, and from audiences at the University of Connecticut, at McGill University, at Goethe-University in Frankfurt and the Psychology Department at Cornell. 1 I will use the terminology employed in Huddleston & Pullum (2001), who refer to and and or as the 'connectors' of coordinate structures and the parts they conjoin as the 'coordinates'.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.