Abstract:This paper introduces the new OLdenburg LOgatome speech corpus (OLLO) and outlines design considerations during its creation. OLLO is distinct from previous ASR corpora as it specifically targets (1) the fair comparison between human and machine speech recognition performance, and (2) the realistic representation of intrinsic variabilities in speech that are significant for automatic speech recognition (ASR) systems. To enable an unbiased human-machine comparison, OLLO is designed for recognition of individual… Show more
“…to control stimulus delivery. The stimuli were the two syllables /ki/ and /ka/ and they were taken from the Oldenburg logatome speech corpus (OLLO; Wesker et al, 2005 ). The syllables were cut out of the available logatomes from one speaker (female speaker 1, V6 ‘normal spelling style’, no dialect).…”
“…to control stimulus delivery. The stimuli were the two syllables /ki/ and /ka/ and they were taken from the Oldenburg logatome speech corpus (OLLO; Wesker et al, 2005 ). The syllables were cut out of the available logatomes from one speaker (female speaker 1, V6 ‘normal spelling style’, no dialect).…”
“…We also used recordings taken from the OLLO database (Wesker et al, 2005), in German. The stimuli were already cut out.…”
Section: Discussion and Summarymentioning
confidence: 99%
“…Native speakers of English, French, Brazilian Portuguese, Turkish, Estonian and Bavarianaccented German recorded consonant-vowelconsonant (CVC) stimuli in carrier sentences. We also used recordings of six speakers taken from the OLLO database (Wesker et al, 2005), in standard German. Together, we use these CVC stimuli as the basis for items in both a discrimination and an assimilation experiment, described below.…”
Our native language influences the way we perceive speech sounds, affecting our ability to discriminate non-native sounds. We compare two ideas about the influence of the native language on speech perception: the Perceptual Assimilation Model, which appeals to a mental classification of sounds into native phoneme categories, versus the idea that rich, fine-grained phonetic representations tuned to the statistics of the native language, are sufficient. We operationalize this idea using representations from two state-of-the-art speech models, a Dirichlet process Gaussian mixture model and the more recent wav2vec 2.0 model. We present a new, open dataset of French-and English-speaking participants' speech perception behaviour for 61 vowel sounds from six languages. We show that phoneme assimilation is a better predictor than fine-grained phonetic modelling, both for the discrimination behaviour as a whole, and for predicting differences in discriminability associated with differences in native language background. We also show that wav2vec 2.0, while not good at capturing the effects of native language on speech perception, is complementary to information about native phoneme assimilation, and provides a good model of low-level phonetic representations, supporting the idea that both categorical and fine-grained perception are used during speech perception.
“…We designed stimuli that have formant patterns extracted from speech samples /gu/, /fu/, and /pu/ (Figure 1). Speech samples were extracted from the Oldenburg Logatome Corpus (OLLO) speech database [Wesker et al, 2005]. We chose VCV combination syllables with German speakers with no dialect.…”
Comodulated masking noise and binaural cues can facilitate detecting a target sound from noise. These cues can induce a decrease in detection thresholds, quantified as comodulation masking release (CMR) and binaural masking level difference (BMLD), respectively. However, their relevance to speech perception is unclear as most studies have used artificial stimuli different from speech. Here, we investigated their ecological validity using sounds with speech-like spectro-temporal dynamics. We evaluated the ecological validity of such grouping effect with stimuli reflecting formant changes in speech. We set three masker bands at formant frequencies F1, F2, and F3 based on CV combination: /gu/, /fu/, and /pu/. We found that the CMR was little (< 3 dB) while BMLD was comparable to previous findings (∼ 9 dB). In conclusion, we suggest that other features may play a role in facilitating frequency grouping by comodulation such as the spectral proximity and the number of masker bands.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.