In automatic speech recognition, often little training data is available for specific challenging tasks, but training of state-of-the-art automatic speech recognition systems requires large amounts of annotated speech. To address this issue, we propose a two-staged approach to acoustic modeling that combines noise and reverberation data augmentation with transfer learning to robustly address challenges such as difficult acoustic recording conditions, spontaneous speech, and speech of elderly people. We evaluate our approach using the example of German oral history interviews, where a relative average reduction of the word error rate by 19.3% is achieved.
The paper presents aims and results of the project KA³ (Kölner Zentrum Analyse und Archivierung von audio-visual-Daten), in which advanced speech technologies are developed and provided to enhance the process of indexing and analysing speech recordings from the oral history domain and the language sciences. Close cooperation between speech technology scientists and digital humanities researchers is an important aspect of the project making sure that the development of the technologies answers the needs of research based on qualitative audio-visual interviews. For practical research reasons, the project focuses on the audio aspect, although visual aspects are of course equally important for the analysis of audio-visual data. The Cologne Centre for Analysis and Archiving of audio-visual data will provide the technologies as a central service.
Neural networks have proven their ability to be usefully applied as component of a speech enhancement system. This is based on the known feature of neural nets to map regions inside a feature space to other regions. It can be taken to map noisy magnitude spectra to clean spectra. This way the net can be used to substitute an adaptive filtering in the spectral domain. We set up such a system and compared its performance against a known adaptive filtering approach in terms of speech quality and in terms of recognition rate. It is a still not fully answered question how far the speech quality can be enhanced by modifying not only the magnitude but also the spectral phase and how this phase modification could be realized. Before trying to use a neural network for a possible modification of the phase spectrum we ran a set of oracle experiments to find out how far the quality can be improved by modifying the magnitude and/or the phase spectrum in voiced segments. It turns out that the simultaneous modification of magnitude and phase spectrum has the potential for a considerable improvement of the speech quality in comparison to modifying the magnitude or the phase only.
Automatic speech recognition systems have accomplished remarkable improvements in transcription accuracy in recent years. On some domains, models now achieve near-human performance. However, transcription performance on oral history has not yet reached human accuracy. In the present work, we investigate how large this gap between human and machine transcription still is. For this purpose, we analyze and compare transcriptions of three humans on a new oral history data set. We estimate a human word error rate of 8.7 % for recent German oral history interviews with clean acoustic conditions. For comparison with recent machine transcription accuracy, we present experiments on the adaptation of an acoustic model achieving near-human performance on broadcast speech. We investigate the influence of different adaptation data on robustness and generalization for clean and noisy oral history interviews. We optimize our acoustic models by 5 to 8 % relative for this task and achieve 23.9 % WER on noisy and 15.6 % word error rate on clean oral history interviews.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.