2018
DOI: 10.1371/journal.pone.0205355
|View full text |Cite
|
Sign up to set email alerts
|

Multi-resolution speech analysis for automatic speech recognition using deep neural networks: Experiments on TIMIT

Abstract: Speech Analysis for Automatic Speech Recognition (ASR) systems typically starts with a Short-Time Fourier Transform (STFT) that implies selecting a fixed point in the time-frequency resolution trade-off. This approach, combined with a Mel-frequency scaled filterbank and a Discrete Cosine Transform give rise to the Mel-Frequency Cepstral Coefficients (MFCC), which have been the most common speech features in speech processing for the last decades. These features were particularly well suited for the previous Hi… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

1
15
0
1

Year Published

2019
2019
2024
2024

Publication Types

Select...
7
1

Relationship

0
8

Authors

Journals

citations
Cited by 29 publications
(22 citation statements)
references
References 20 publications
1
15
0
1
Order By: Relevance
“…Likewise, we observe that the proposed YOLOv3 model is on a par with most of stat-of-the-art models such as [50,51,100] and outperforms many of cutting edge models such as [1], [48], [52], [53].…”
Section: Comparison Of Proposed Iats With State-of-the-art Methodsmentioning
confidence: 66%
See 1 more Smart Citation
“…Likewise, we observe that the proposed YOLOv3 model is on a par with most of stat-of-the-art models such as [50,51,100] and outperforms many of cutting edge models such as [1], [48], [52], [53].…”
Section: Comparison Of Proposed Iats With State-of-the-art Methodsmentioning
confidence: 66%
“…A DBN with multiple hidden layers was also proposed by the same authors in [47] and achieved a 20.7% PER on TIMIT. Recently, a DNN acoustic model for TIMIT phone recognition based on multi resolution speech representation proposed in [48] achieved the best PER of 18.25%. The performances of a feed forward DNN, time delay neural network (TDNN), and long short-term memory (LSTM) are explored in [44] for TIMIT phone recognition, where LSTMbased phone recognition achieved a PER of 15.02%.…”
Section: ) English Dnn-based Asrmentioning
confidence: 99%
“…The remarkable rise of deep learning (DL) relying on the robust function approximations and representation properties of deep neural networks has provided us with new tools to automatically find compact low-dimensional representations (features) of high-dimensional data (LeCun et al, 2015 ). DL models have achieved outstanding predictive performance making dramatic breakthroughs in a wide range of applications, including automatic speech processing and image recognition (Toledano et al, 2018 ; Kim et al, 2019 ; Hey et al, 2020 ; Wu et al, 2020 ). In the words of Geoffrey Hinton who is the founder of DL technologies “Deep Learning is an algorithm which has no theoretical limitations on what it can learn; the more data you give and the more computational time you provide the better it is” (LeCun et al, 2015 ).…”
Section: The Rise Of the Machines: Allosteric Mechanisms Through The mentioning
confidence: 99%
“…iii) Classification is the process of mapping the feature vector of an input word into 1 out of N word classes of the considered vocabulary during testing. Some popularly used classifiers in ASR are Artificial Neural Network (ANN) [5], [10], [12], [13], Hidden Markov model (HMM) [14], [15], Dynamic Time Warping (DTW) [16], [17], Deep Neural Network [9], [47], [51], etc. The application of ANN in designing ASR system is still being used by researchers [5], [6], [19], [20], [21], [22], [23], [36], [40], [42] despite the developments in the field of deep neural network (DNN) in recent times.…”
Section: Introductionmentioning
confidence: 99%
“…In recording the speech utterances, the following hardware and software These speakers do not have any history of speech disorders. As there is no specific rule about the male-female proportion in construction of speech database, literatures[51],[56],[58] have considered various proportions like 60%-40%, 70%-30%, 65%-35%, etc. The speakers in this work are chosen from Sylheti speaking areas in the Karimganj district of the state of Assam and the Kailasahar and Kumarghat districts of the state of Tripura, India where they have been living since their childhood.…”
mentioning
confidence: 99%