The task of transcribing children's speech using statistical models trained on adults' speech is very challenging. Large mismatch in the acoustic and linguistic attributes of the training and test data is reported to degrade the performance. In such speech recognition tasks, the differences in pitch (or fundamental frequency) between the two groups of speakers is one among several mismatch factors. To overcome the pitch mismatch, an existing pitch scaling technique based on iterative spectrogram inversion is explored in this work. Explicit pitch scaling is found to improve the recognition of children's speech under mismatched setup. In addition to that, we have also studied the effect of discarding the phase information during spectrum reconstruction. This is motivated by the fact that the dominant acoustic feature extraction techniques make use of the magnitude spectrum only. On evaluating the effectiveness under mismatched testing scenario, the existing as well as the modified pitch scaling techniques result in very similar recognition performances. Furthermore, we have explored the role of pitch scaling on another speech recognition system which is trained on speech data from both adult and child speakers. Pitch scaling is noted to be effective for children's speech recognition in this case as well.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.