Multi-resolution speech analysis for automatic speech recognition using deep neural networks: Experiments on TIMIT

Toledano, Doroteo T.; Fernández-Gallego, María Pilar; Lozano-Díez, Alicia

doi:10.1371/journal.pone.0205355

Cited by 29 publications

(22 citation statements)

References 20 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…Likewise, we observe that the proposed YOLOv3 model is on a par with most of stat-of-the-art models such as [50,51,100] and outperforms many of cutting edge models such as [1], [48], [52], [53].…”

Section: Comparison Of Proposed Iats With State-of-the-art Methodsmentioning

confidence: 66%

“…A DBN with multiple hidden layers was also proposed by the same authors in [47] and achieved a 20.7% PER on TIMIT. Recently, a DNN acoustic model for TIMIT phone recognition based on multi resolution speech representation proposed in [48] achieved the best PER of 18.25%. The performances of a feed forward DNN, time delay neural network (TDNN), and long short-term memory (LSTM) are explored in [44] for TIMIT phone recognition, where LSTMbased phone recognition achieved a PER of 15.02%.…”

Section: ) English Dnn-based Asrmentioning

confidence: 99%

See 1 more Smart Citation

Towards Deep Object Detection Techniques for Phoneme Recognition

et al. 2020

View full text Add to dashboard Cite

The use of cutting edge object detection techniques to build an accurate phoneme sequence recognition system for English and Arabic languages is investigated in this study. Recently, numerous techniques have been proposed for object detection in daily life applications using deep learning. In this paper, we propose the use of object detection techniques in speech processing tasks. We selected two state-of-the-art object detectors, namely YOLO and CenterNet, based on a trade-off between detection accuracy and speed. We tackled the problem of phoneme sequence recognition using three systems: the domain transfer learning system (DTS) from image to speech, intra-language transfer leaning system (IaTS) between speech corpora within the same language (English to English), and inter-language transfer learning system (IeTS) between speech corpora from dissimilar languages (English to Arabic). For English phoneme recognition, the Texas Instruments/Massachusetts Institute of Technology (TIMIT) corpus is used to evaluate the performance of the proposed systems. Our IaTS based on the CenterNet detector achieves the best results using the test core set of TIMIT with 15.89% phone error rate (PER). For Arabic phoneme recognition, the best performance, with 7.58% PER, was achieved using the CenterNet. These results show the effectiveness of using object detection techniques in phoneme recognition tasks. Furthermore, based on the findings of this study, speech processing tasks may be treated as object detection tasks. INDEX TERMS CenterNet, object detection, phoneme recognition, transfer learning, YOLO.

show abstract

Section: Comparison Of Proposed Iats With State-of-the-art Methodsmentioning

confidence: 66%

Section: ) English Dnn-based Asrmentioning

confidence: 99%

Towards Deep Object Detection Techniques for Phoneme Recognition

et al. 2020

View full text Add to dashboard Cite

show abstract

“…The remarkable rise of deep learning (DL) relying on the robust function approximations and representation properties of deep neural networks has provided us with new tools to automatically find compact low-dimensional representations (features) of high-dimensional data (LeCun et al, 2015 ). DL models have achieved outstanding predictive performance making dramatic breakthroughs in a wide range of applications, including automatic speech processing and image recognition (Toledano et al, 2018 ; Kim et al, 2019 ; Hey et al, 2020 ; Wu et al, 2020 ). In the words of Geoffrey Hinton who is the founder of DL technologies “Deep Learning is an algorithm which has no theoretical limitations on what it can learn; the more data you give and the more computational time you provide the better it is” (LeCun et al, 2015 ).…”

Section: The Rise Of the Machines: Allosteric Mechanisms Through The mentioning

confidence: 99%

Allosteric Regulation at the Crossroads of New Technologies: Multiscale Modeling, Networks, and Machine Learning

Verkhivker

Agajanian

et al. 2020

Front. Mol. Biosci.

View full text Add to dashboard Cite

“…iii) Classification is the process of mapping the feature vector of an input word into 1 out of N word classes of the considered vocabulary during testing. Some popularly used classifiers in ASR are Artificial Neural Network (ANN) [5], [10], [12], [13], Hidden Markov model (HMM) [14], [15], Dynamic Time Warping (DTW) [16], [17], Deep Neural Network [9], [47], [51], etc. The application of ANN in designing ASR system is still being used by researchers [5], [6], [19], [20], [21], [22], [23], [36], [40], [42] despite the developments in the field of deep neural network (DNN) in recent times.…”

Section: Introductionmentioning

confidence: 99%

“…In recording the speech utterances, the following hardware and software These speakers do not have any history of speech disorders. As there is no specific rule about the male-female proportion in construction of speech database, literatures[51],[56],[58] have considered various proportions like 60%-40%, 70%-30%, 65%-35%, etc. The speakers in this work are chosen from Sylheti speaking areas in the Karimganj district of the state of Assam and the Kailasahar and Kumarghat districts of the state of Tripura, India where they have been living since their childhood.…”

mentioning

confidence: 99%

Speech Recognition of Isolated Words using a New Speech Database in Sylheti

Chakraborty*¹,

Saikia²

2019

IJRTE

View full text Add to dashboard Cite

With the advancements in the field of artificial intelligence, speech recognition based applications are becoming more and more popular in the recent years. Researchers working in many areas including linguistics, engineering, psychology, etc. have been trying to address various aspects relating to speech recognition in different natural languages around the globe. Although many interactive speech applications in "well-resourced" major languages are being developed, uses of these applications are still limited due to language barrier. Hence, researchers have also been concentrating to design speech recognition system in various under-resourced languages. Sylheti is one of such under-resourced languages primarily spoken in the Sylhet division of Bangladesh and also spoken in the southern part of Assam, India. This paper has two contributions: i) it presents a new speech database of isolated words for the Sylheti language, and ii) it presents speech recognition systems for the Sylheti language to recognize isolated Sylheti words by applying two variants of neural network classifiers. The performances of these recognition systems are evaluated with the proposed database and the observations are presented.

show abstract

Multi-resolution speech analysis for automatic speech recognition using deep neural networks: Experiments on TIMIT

Cited by 29 publications

References 20 publications

Towards Deep Object Detection Techniques for Phoneme Recognition

Towards Deep Object Detection Techniques for Phoneme Recognition

Allosteric Regulation at the Crossroads of New Technologies: Multiscale Modeling, Networks, and Machine Learning

Speech Recognition of Isolated Words using a New Speech Database in Sylheti

Contact Info

Product

Resources

About