Effect of Dataset Size on Deep Learning in Voice Recognition

Çayır, Ayşe Nur; Navruz, Tuğba Selcen

doi:10.1109/hora52670.2021.9461395

Cited by 10 publications

(5 citation statements)

References 5 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The significance of the proposed voice augmentation technique was compared with the ordinary voice recognition augmentation and regularization techniques. Although the related works [30]- [31], [33] - [34] showed that the CNN model is an exemplary model for vocabulary-size speech recognition, we have proven that the fusion CNN-LSTM model is superior to the pure CNN and pure LSTM for two separate datasets. The LSTM model improved the inconsistent performance of the CNN model when CNN and LSTM were hybridized together.…”

Section: Introductionmentioning

confidence: 59%

“…Similarly, Wubet and Lian [32] showed that CNN is better than the SVM model for keyword recognition, and surprisingly, a hybrid of CNN-SVM outperformed pure CNN and pure SVM. Cayir and Navruz [33] investigated the influence of a limited size dataset for voice command recognition using 12 different voice commands ("down", "forward", "follow", "go", "left", "on", "off", "right", "stop", "up", and "yes"). Their experimental results showed that when the test dataset included native Turkish speakers, the test accuracy was 94.64% for a large dataset and 64.81% for a small dataset.…”

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

Voice Conversion Based Augmentation and a Hybrid CNN-LSTM Model for Improving Speaker-Independent Keyword Recognition on Limited Datasets

Wubet

Lian

2022

IEEE Access

View full text Add to dashboard Cite

Keyword recognition is the basis of speech recognition, and its application is rapidly increasing in keyword spotting, robotics, and smart home surveillance. Because of these advanced applications, improving the accuracy of keyword recognition is crucial. In this paper, we proposed voice conversion (VC) -based augmentation to increase the limited training dataset and a fusion of a convolutional neural network (CNN) and long-short term memory (LSTM) model for robust speaker-independent isolated keyword recognition. Collecting and preparing a sufficient amount of voice data for speaker-independent speech recognition is a tedious and bulky task. To overcome this, we generated new raw voices from the original voices using an auxiliary classifier conditional variational autoencoder (ACVAE) method. In this study, the main intention of voice conversion is to obtain numerous and various human-like keywords' voices that are not identical to the source and target speakers' pronunciation. Parallel VC was used to accurately maintain the linguistic content. We examined the performance of the proposed voice conversion augmentation techniques using robust deep neural network algorithms. Original training data, excluding generated voice using other data augmentation and regularization techniques, were considered as the baseline. The results showed that incorporating voice conversion augmentation into the baseline augmentation techniques and applying the CNN-LSTM model improved the accuracy of isolated keyword recognition.

show abstract

Section: Introductionmentioning

confidence: 59%

Section: Related Workmentioning

confidence: 99%

Voice Conversion Based Augmentation and a Hybrid CNN-LSTM Model for Improving Speaker-Independent Keyword Recognition on Limited Datasets

Wubet

Lian

2022

IEEE Access

View full text Add to dashboard Cite

show abstract

“…Given these circumstances, an ASR model should be designed to generalize effectively and recognize a wide range of voices even with limited data. In this context, employing a dataset rich in accent diversity will enhance the generalization capabilities of the designed model ( Cayir & Navruz, 2021 ). Therefore, our study aimed to utilize extensive datasets encompassing participants from various accent groups.…”

Section: Introductionmentioning

confidence: 99%

Customized deep learning based Turkish automatic speech recognition system supported by language model

Görmez

2024

PeerJ Computer Science

View full text Add to dashboard Cite

Background In today’s world, numerous applications integral to various facets of daily life include automatic speech recognition methods. Thus, the development of a successful automatic speech recognition system can significantly augment the convenience of people’s daily routines. While many automatic speech recognition systems have been established for widely spoken languages like English, there has been insufficient progress in developing such systems for less common languages such as Turkish. Moreover, due to its agglutinative structure, designing a speech recognition system for Turkish presents greater challenges compared to other language groups. Therefore, our study focused on proposing deep learning models for automatic speech recognition in Turkish, complemented by the integration of a language model. Methods In our study, deep learning models were formulated by incorporating convolutional neural networks, gated recurrent units, long short-term memories, and transformer layers. The Zemberek library was employed to craft the language model to improve system performance. Furthermore, the Bayesian optimization method was applied to fine-tune the hyper-parameters of the deep learning models. To evaluate the model’s performance, standard metrics widely used in automatic speech recognition systems, specifically word error rate and character error rate scores, were employed. Results Upon reviewing the experimental results, it becomes evident that when optimal hyper-parameters are applied to models developed with various layers, the scores are as follows: Without the use of a language model, the Turkish Microphone Speech Corpus dataset yields scores of 22.2 -word error rate and 14.05-character error rate, while the Turkish Speech Corpus dataset results in scores of 11.5 -word error rate and 4.15 character error rate. Upon incorporating the language model, notable improvements were observed. Specifically, for the Turkish Microphone Speech Corpus dataset, the word error rate score decreased to 9.85, and the character error rate score lowered to 5.35. Similarly, the word error rate score improved to 8.4, and the character error rate score decreased to 2.7 for the Turkish Speech Corpus dataset. These results demonstrate that our model outperforms the studies found in the existing literature.

show abstract

“…However, existing deep learning methods still have limitations in side-channel analysis. Deep learning has been well-established for tasks such as image processing [13,14] and speech recognition [15,16], but its application in cryptographic algorithms is relatively limited. Classical handwritten digit classification involves 10 classes, while the analysis of cryptographic keys requires exponentially more classification categories, making existing classification models less suitable.…”

Section: Introductionmentioning

confidence: 99%

A Secret Key Classification Framework of Symmetric Encryption Algorithm Based on Deep Transfer Learning

Cui,

Zhang,

Fang

et al. 2023

Applied Sciences

View full text Add to dashboard Cite

The leakage signals, including electromagnetic, energy, time, and temperature, generated during the operation of password devices contain highly correlated key information, which leads to security vulnerabilities. In traditional encryption algorithms, the length of the key greatly affects the upper limit of its security against cracking. Regarding side-channel attacks on long-key algorithms, traditional template attack methods characterize the energy traces using multivariate Gaussian distribution during the template construction phase. The exhaustive key-guessing process is expected to consume a significant amount of time and computational resources. Therefore, to analyze the effectiveness of obtaining key values from the side information of password devices, we propose an innovative attack method based on a divide-and-conquer logical structure, targeting semi-bytes. We construct a collection of key classification submodules with symmetric correlations. By integrating a differential network model for byte-block sets and an end-to-end direct attack method, we form a holistic symmetric decision framework and propose a key classification structure based on deep transfer learning. This structure consists of three main parts: side information data acquisition, analysis of key-value effectiveness, and determination of attack positions. It employs multiple parallel symmetric subnetworks, effectively improving attack efficiency and reducing the key enumeration range. Experimental results show that the optimal attack accuracy of the network model can reach 91%, with an average attack accuracy of 78%. It overcomes overfitting issues under small sample dataset conditions.

show abstract

Effect of Dataset Size on Deep Learning in Voice Recognition

Cited by 10 publications

References 5 publications

Voice Conversion Based Augmentation and a Hybrid CNN-LSTM Model for Improving Speaker-Independent Keyword Recognition on Limited Datasets

Voice Conversion Based Augmentation and a Hybrid CNN-LSTM Model for Improving Speaker-Independent Keyword Recognition on Limited Datasets

Customized deep learning based Turkish automatic speech recognition system supported by language model

A Secret Key Classification Framework of Symmetric Encryption Algorithm Based on Deep Transfer Learning

Contact Info

Product

Resources

About