This paper proposes a novel front-end for automatic spoken language recognition, based on the spectrogram representation of the speech signal and in the properties of the Fourier spectrum to detect global periodicity in an image. Local Phase Quantization (LPQ) texture descriptor was used to capture the spectrogram content. Results obtained for 30 seconds test signal duration have shown that this method is very promising for low cost language identification. The best performance is achieved when our proposed method is fused with the i-vector representation.
Abstract. The analysis of the speech signal using wavelet packet trees (WPT) is a very flexible tool, capable of effectively manipulate the frequency subbands thanks to the orthonormal bases it provides. Here, dimension reduction becomes very important since the number of subbands grows exponentially with the level of decomposition, and their discriminative relevancy is different, which leads to different resolution for each one. A method based on mutual information is proposed in order to keep as much discriminative information as possible and the less amount of redundant information.
This paper presents the system employed in the Albayzin 2018 "Search on Speech" Evaluation by the Voice Group of CENATAV. The system used in the Spoken Term Detection (STD) task consists on an Automatic Speech Recognizer (ASR) and a module to detect the terms. The open source Kaldi toolkit is used to build both modules. ASR acoustic models are based on DNN-HMM, S-GMM or GMM-HMM, trained with audio data provided by the organizers and other obtained from ELDA. The lexicon and trigram language model are obtained from the text associated to the audio. The ASR generates the lattices and the word alignments required to detect the terms. Results with development data shown that DNN-HMM model brings up a behavior better or similar to obtained in previous challenges.
Most common approaches to phonotactic language recognition deal with phone decoders as tokenizers. However, units that are not linked to phonetic definitions can be more universals, and therefore conceptually easier to adopt. It is assumed that the overall sound characteristics of all spoken languages can be covered by a broad collection of acoustic units, which can be characterized by acoustic segments. In this paper, such acoustic units, highly desirables for a more general language characterization, are delimited and clustered using Gaussian Mixture Model. A new segmentation method on acoustic units of the speech is proposed for later Gaussian modelling, looking for substitute the phonetic recognizer. This tokenizer is trained over untranscribed data, and it precedes the statistical language modeling phase.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.