In this paper, we introduce two new confidence measures for large vocabulary speech recognition systems. The major feature of these measures is that they can be computed without waiting for the end of the audio stream. We proposed two kinds of confidence measures: frame-synchronous and local. The frame-synchronous ones can be computed as soon as a frame is processed by the recognition engine and are based on a likelihood ratio. The local measures estimate a local posterior probability in the vicinity of the word to analyze. We evaluated our confidence measures within the framework of the automatic transcription of French broadcast news with the EER criterion. Our local measures achieved results very close to the best state-of-the-art measure (EER of 23% compared to 22.0%). We then conducted a preliminary experiment to assess the contribution of our confidence measure in improving the comprehension of an automatic transcription for the hearing impaired. We introduced several modalities to highlight words of low confidence in this transcription. We showed that these modalities used with our local confidence measure improved the comprehension of automatic transcription.
The bioacoustic event indexing has to be scaled in space (oceans and large forests, multiple sensors), and in species number (thousand). We discuss why time-frequency featuring is inefficient compared to the sparse coding (SC) for soundscape analysis. SC is based on the principle that an optimal code should contain enough information to reconstruct the input near regions of high data density, and should not contain enough information to reconstruct inputs in regions of low data density. It has been shown that SC methods can be real-time. We illustrate with an application to humpack whale songs to determine stable components versus evolving ones across season and years. By sparsing at different time scale, the results show that the shortest humpack acoustic codes are the most stable (occurring with similar structure across two consecutive years). Another illustration is given on forest soundscape analysis, where we show that time-frequency atomes allow an easier analysis of forest sound organization, without initial classification of the events. These researches are developed within the interdisciplinary CNRS project “Scale Acoustic Biodiversity,” with Univ. of Toulon, Paris Natural History Museum, and Paris 6, consisting into efficient processes for conditioning and representing relevant bioacoustic. Information, with examples at sabiod.univ-tln.fr.
This paper is an analysis of adaptation techniques for French acoustic models (hidden Markov models). The LVCSR engine Julius, the Hidden Markov Model Toolkit (HTK) and the K-Fold CV technique are used together to build three different adaptation methods: Maximum Likelihood a priori (ML) , Maximum Likelihood Linear Regression (MLLR) and Maximum a posteriori (MAP). Experimental results by means of word and phoneme error rate indicate that the best adaptation method depends on the adaptation data, and that the acoustic models performance can be improved by the use of alignments at phoneme-level and K-Fold Cross Validation (CV). The very known K-Fold CV technique will point to the best adaptation technique to follow considering each case of data type.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.