Feature extraction is a critical stage of digital speech processing systems. Quality of features is of great importance to provide a solid foundation upon which the subsequent stages stand. Distinctive phonetic features (DPFs) are one of the most representative features of the speech signals. The significance of DPFs is in their ability to provide abstract description of the places and manners of articulation of the language phonemes. A phoneme's DPF element reflects unique articulatory information about that phoneme. Therefore, there is a need to discover and investigate each DPF element individually in order to achieve a deeper understanding and to come up with a descriptive model for each one. Such fine-grained modeling will satisfy the uniqueness of each DPF element. In this paper, the problem of DPF modeling and extraction of modern standard Arabic is tackled. Due to the remarkable success of deep neural networks (DNNs) that are initialized using deep belief networks (DBNs) in serving DSP applications and its capability of extracting highly representative features from the raw data, we exploit its modeling power to investigate and model the DPF elements. DNN models are compared with the classical multilayer perceptron (MLP) models. The representativeness of several acoustic cues for different DPF elements was also measured. This paper is based on formalizing DPF modeling problem as a binary classification problem. Because the DPF elements are highly imbalanced data, evaluating the quality of models is a very tricky process. This paper addresses the proper evaluation measures satisfying the imbalanced nature of the DPF elements. After modeling each element individually, the two top-level DPF extractors are designed: MLP-and DNN-based extractors. The results show the quality of DNN models and their superiority over MLPs with accuracies of 89.0% and 86.7%, respectively.INDEX TERMS Modern standard Arabic, distinctive phonetic features, speech processing, deep belief networks, restricted Boltzmann machine.YASSER SEDDIQ received the B.S. degree in computer engineering from the King Fahd University of Petroleum and Minerals (KFUPM), Dhahran, Saudi Arabia, in 2004, and the M.S. degree in computer engineering and the Ph.D. degree in computer and information sciences (computer engineering) from King Saud University (KSU), Riyadh, Saudi Arabia, in 2010 and 2017, respectively. He is currently an Assistant Research Professor with the King Abdulaziz City for Science and Technology (KACST), Riyadh. His research interests include digital signal processing, speech processing, image processing, computer arithmetic, and digital systems design using FPGA.
Most research in the field of digital speech technology has traditionally been conducted in only a few languages, such as English, French, Spanish, or Chinese. Numerous studies using distinctive phonetic features (DPFs) with different techniques and algorithms have been carried out during the last 3 decades, mainly in English, Japanese, and other languages of industrialized countries. DPF elements are based on a technique used by linguists and digital speech and language experts to distinguish between different phones by considering the lowest level of actual features during phonation. These studies have investigated the best performances, outcomes, and theories, especially those regarding digital speech recognition. The aim of this paper is to present the background of DPF theories and the usefulness thereof for digital speech and language processing. In addition, we highlight the background of Arabic language phonology compared to 2 well-known languages to enhance the current knowledge about this narrow language discipline. Finally, this work reviews the research dealing with DPF strategies for digital speech and language processing using computing and engineering techniques and theories. Based on the literature search conducted for this paper, we conclude that although the Arabic language is a very important and old Semitic language, hitherto it has suffered from a lack of modern research resources and theories on DPF elements.
The recording device along with the acoustic environment plays a major role in digital audio forensics. We propose an acoustic source identification system in this paper, which includes identifying both the recording device and the environment in which it was recorded. A hybrid Convolutional Neural Network (CNN) with Long Short-Term Memory (LSTM) is used in this study to automatically extract environments and microphone features from the speech sound. In the experiments, we investigated the effect of using the voiced and unvoiced segments of speech on the accuracy of the environment and microphone classification. We also studied the effect of background noise on microphone classification in 3 different environments, i.e., very quiet, quiet, and noisy. The proposed system utilizes a subset of the KSU-DB corpus containing 3 environments, 4 classes of recording devices, 136 speakers (68 males and 68 females), and 3600 recordings of words, sentences, and continuous speech. This research combines the advantages of both CNN and RNN (in particular bidirectional LSTM) models, called CRNN. The speech signals were represented as a spectrogram and were fed to the CRNN model as 2D images. The proposed method achieved accuracies of 98% and 98.57% for environment and microphone classification, respectively, using unvoiced speech segments.
Distinctive phonetic features have an important role in Arabic speech phoneme recognition. In a given language, distinctive phonetic features are extrapolated from acoustic features using different methods. However, exploiting lengthy acoustic features vector in the sake of phoneme recognition has a huge cost in terms of computational complexity, which in turn, affects real time applications. The aim of this work is to consider methods to reduce the size of features vector employed for distinctive phonetic feature and phoneme recognition. The objective is to select the relevant input features that contribute to the speech recognition process. This, in turn, will lead to a reduced computational complexity of recognition algorithm, and an improved recognition accuracy. In the proposed approach, genetic algorithm is used to perform optimal features selection. Therefore, a baseline model based on feedforward neural networks is first built. This model is used to benchmark the results of proposed features selection method with a method that employs all elements of a features vector. Experimental results, utilizing the King Abdulaziz City for Science and Technology Arabic Phonetic Database, show that the average genetic algorithm based phoneme overall recognition accuracy is maintained slightly higher than that of recognition method employing the full-fledge features vector. The genetic algorithm based distinctive phonetic features recognition method has achieved a 50% reduction in the dimension of the input vector while obtaining a recognition accuracy of 90%. Moreover, the results of the proposed method is validated using Wilcoxon signed rank test.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.