Being able to recognize people from their voice is a natural ability that we take for granted. Recent advances have shown significant improvement in automatic speaker recognition performance. Besides being able to process large amount of data in a fraction of time required by human, automatic systems are now able to deal with diverse channel effects. The goal of this paper is to examine how state-of-the-art automatic system performs in comparison with human listeners, and to investigate the strategy for human-assisted form of automatic speaker recognition, which is useful in forensic investigation. We set up an experimental protocol using data from the NIST SRE 2008 core set. A total of 36 listeners have participated in the listening experiments from three sites, namely Australia, Finland and Singapore. State-of-the-art automatic system achieved 20% error rate, whereas fusion of human listeners achieved 22%.
Dynamic cepstral features such as delta and deltadelta cepstra have been shown to play an essential role in capturing the transitional characteristics of the speech signal. In this paper, a set of new dynamic features for speaker verification system are introduced. These new features, known as Delta Cepstral Energy (DCE) and Delta-Delta Cepstral Energy (DDCE), can compactly represent the information in the delta and delta-delta cepstra. Further, it is shown theoretically that DCE carries the same information as the delta cepstrum using an entropy criterion. Experimental speaker verification results on the TIMIT database support the theoretical result, showing a significant improvement in terms of equal error rate compared with conventional feature extraction methods using delta and delta-delta cepstra.
This paper presents a segment selection technique for discarding portions of speech that result in poor discrimination ability in speaker verification tasks. Theory supporting the significance of a frame selection procedure for test segments, prior to making decisions, is also developed. This approach has the ability to reduce the effect of the acoustic regions of speech that are not accurately represented due to a lack of training data. Compared with a baseline system using both CMS and variance normalization, the proposed segment selection technique brings 24% relative reduction in error rate over the entire testing data of the 2002 NIST Dataset in terms of minimum DCF. For short test segments, i.e. less than 15 seconds, the novel frame dropping technique produces a significant relative error rate reduction of 23% in terms of minimum DCF.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.