Identifying the influence of transfer learning method in developing an end-to-end automatic speech recognition system with a low data level

Mamyrbayev, Оrken; Alimhan, Keylan; Оралбекова, Дина; Bekarystankyzy, Akbayan; Zhumazhanov, Bagashar

doi:10.15587/1729-4061.2022.252801

Cited by 14 publications

(9 citation statements)

References 28 publications

(34 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Suppose that for image classification, a CNN with architecture S and parameters P is presynthesized, CNN={S, P} [14,15]. This network has already learned earlier to extract features for solving the problem of image classification.…”

Section: Formulation Of the Problemmentioning

confidence: 99%

Xception transfer learning with early stopping for facial age estimation

Polyakova,

Rogachko,

Nesteriuk

et al. 2024

AAIT

View full text Add to dashboard Cite

The rapid development of deep learning attracts more attention to the analysis of person's face images. Deep learning methodsof facial age estimation are more effective compared to methods based on anthropometric models, models of active appearance, texture models, subspace of aging patterns. However, deep learning networks require more computing power to process images. Pre-trained models do not need a large training set and their training time is less. However, the parameters obtained as a result of transfer learning of the pre-training network significantly affect its efficiency. It is also necessary to take into account the properties of the processed images, in particular, the conditions under which they were obtained.Recently, the facial age estimation is implemented in applications in devices with limited resources of computing, for example, in smartphones. The memory size and power consumption of such applications are limited by the computing power of mobile devices. In addition, when photographing a person's face with a smartphone camera, it is very difficult to ensure the uniform lighting. The aim of the research is reducing the error of facial age estimation from uneven illuminated images by applying an early stopping of transfer learning of the Xception network. The proposed technique of transfer learning includes an early stopping of training, if the improvement of the results is not observed within a certain number of epochs. Then the network weights from the epoch with the lowest validation loss are saved. As a result of the proposed technique applying, the average absolute error of age estimation was about five years from unevenly illuminated test images. A number of parameters of the used in this case Xceptionnetwork is less than that of other deep learning neural networks which solved the age estimation problem. Then applying of the Xception network reduces the resource consumption of devices with limited computing power. Prospects for further research are reducing the unevenness of facial image lighting to decrease the error of age estimation. Also, to reduce the computing resources, it is promising to use fast transforms in the Xception convolutional layers.

show abstract

Section: Formulation Of the Problemmentioning

confidence: 99%

Xception transfer learning with early stopping for facial age estimation

Polyakova,

Rogachko,

Nesteriuk

et al. 2024

AAIT

View full text Add to dashboard Cite

show abstract

“…Besides it has been proposed to replace UBM and i-vector classifier by deep neural network (DNN) taking into account the experience of deep learning for speech recognition [9,10]. The DNNbased d-vector framework assigns the ground-truth speaker identity of a training speech signal as the labels of the training frames of this signal.…”

Section: Analysis Of Recent Research and Publicationsmentioning

confidence: 99%

“…Raw wave neural networks take raw waves in the time domain as the inputs to extract learnable acoustic features [10,16]. It has been observed that with the use of CNN the filters of the first convolution layer capture the speaker information in low frequency regions.…”

Section: Analysis Of Recent Research and Publicationsmentioning

confidence: 99%

The structural tuning of the convolutional neural network for speaker identification in mel frequency cepstrum coefficients space

Matychenko,

Polyakova

2023

HAIT

View full text Add to dashboard Cite

As a result of the literature analysis, the main methods for speaker identification from speech signals were defined. These are statistical methods based on Gaussian mixture model and a universal background model, as well as neural network methods, in particular, using convolutional or Siamese neural networks. The main characteristics of these methods are the recognition performance, a number of parameters, and the training time. High recognition performance is achieved by using convolutional neural networks, but a number of parameters of these networks are much higher than for statistical methods, although lower than for Siamese neural networks. A large number of parameters require a large training set, which is not always available for the researcher. In addition, despite the effectiveness of convolutional neural networks, model size and inference efficiency remain important fordevices with a limited source of computing power, such as peripheral or mobile devices. Therefore, the aspects of tuning of the structure of existing convolutional neural networks are relevant for research. In this work, we have performed a structural tuning of an existing convolutional neural network based on the VGGNet architecture for speaker identification in the space of mel frequency cepstrum coefficients. The aim of the work was to reduce the number of neural network parameters and, as a result, to reduce the network training time, provided that the recognition performance is sufficient (the correct recognition is above 95 %). The neural network proposed as a result of structural tuning has fewer layers than the architecture of the basic neural network. Insteadof the ReLU activation function, the related Leaky ReLU function with a parameter of 0.1 was used. The number of filters and the size of kernels in convolutional layers are changed. The size of kernels for the max pooling layer has been increased. It is proposedto use the averaging of the results of each convolution to input a two-dimensional convolution results to a fully connected layer with the Softmax activation function. The performed experiment showed that the number of parameters of the proposed neural network is 29 % less than the number of parameters of the basic neural network, provided that the speaker recognition performance is almost the same. In addition, the training time of the proposed and basic neural networks was evaluated on five datasets of audio recordings corresponding to different numbers of speakers. The training time of the proposed network was reduced by 10-39 % compared to the basic neural network. The results of the research show the advisability of the structural tuning of the convolutional neural network for devices with a limited source of computing, namely, peripheral or mobile devices.

show abstract

“…Kazakh language is a low-resource language from the Turkic family of languages, which belong to agglutinative languages. Almost all of these languages suffer from data shortages 1 . Namely, they suffer from a lack of transcribed audio data.…”

Section: Introductionmentioning

confidence: 99%

Multilingual end-to-end ASR for low-resource Turkic languages with common alphabets

Bekarystankyzy,

Mamyrbayev,

Mendes

et al. 2024

Sci Rep

View full text Add to dashboard Cite

To obtain a reliable and accurate automatic speech recognition (ASR) machine learning model, it is necessary to have sufficient audio data transcribed, for training. Many languages in the world, especially the agglutinative languages of the Turkic family, suffer from a lack of this type of data. Many studies have been conducted in order to obtain better models for low-resource languages, using different approaches. The most popular approaches include multilingual training and transfer learning. In this study, we combined five agglutinative languages from the Turkic family—Kazakh, Bashkir, Kyrgyz, Sakha, and Tatar,—in order to provide multilingual training using connectionist temporal classification and an attention mechanism including a language model, because these languages have cognate words, sentence formation rules, and alphabet (Cyrillic). Data from the open-source database Common voice was used for the study, to make the experiments reproducible. The results of the experiments showed that multilingual training could improve ASR performances for all languages included in the experiment, except Bashkir language. A dramatic result was achieved for the Kyrgyz language: word error rate decreased to nearly one-fifth and character error rate decreased to one-fourth, which proves that this approach can be helpful for critically low-resource languages.

show abstract

Identifying the influence of transfer learning method in developing an end-to-end automatic speech recognition system with a low data level

Cited by 14 publications

References 28 publications

Xception transfer learning with early stopping for facial age estimation

Xception transfer learning with early stopping for facial age estimation

The structural tuning of the convolutional neural network for speaker identification in mel frequency cepstrum coefficients space

Multilingual end-to-end ASR for low-resource Turkic languages with common alphabets

Contact Info

Product

Resources

About