End-to-End Mandarin Speech Recognition Combining CNN and BLSTM

Wang, Dong; Wang, Xiaodong; Lv, Shaohe

doi:10.3390/sym11050644

Cited by 31 publications

(16 citation statements)

References 20 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The second approach is end-to-end speech recognition. It differs from sequential hierarchical analysis in that it allows you to analyze the original signal and move to higher levels of analysis (for example, the level of words), bypassing lower levels [17,18].…”

Section: Methods Of Syllable Recognitionmentioning

confidence: 99%

Evaluation of Speech Quality Through Recognition and Classification of Phonemes

2019

View full text Add to dashboard Cite

This paper discusses an approach for assessing the quality of speech while undergoing speech rehabilitation. One of the main reasons for speech quality decrease during the surgical treatment of vocal tract diseases is the loss of the vocal tractˈs parts and the disruption of its symmetry. In particular, one of the most common oncological diseases of the oral cavity is cancer of the tongue. During surgical treatment, a glossectomy is performed, which leads to the need for speech rehabilitation to eliminate the occurring speech defects, leading to a decrease in speech intelligibility. In this paper, we present an automated approach for conducting the speech quality evaluation. The approach relies on a convolutional neural network (CNN). The main idea of the approach is to train an individual neural network for a patient before having an operation to recognize typical sounding of phonemes for their speech. The neural network will thereby be able to evaluate the similarity between the patientˈs speech before and after the surgery. The recognition based on the full phoneme set and the recognition by groups of phonemes were considered. The correspondence of assessments obtained through the autorecognition approach with those from the human-based approach is shown. The automated approach is principally applicable to defining boundaries between phonemes. The paper shows that iterative training of the neural network and continuous updating of the training dataset gradually improve the ability of the CNN to define boundaries between different phonemes.

show abstract

Section: Methods Of Syllable Recognitionmentioning

confidence: 99%

Evaluation of Speech Quality Through Recognition and Classification of Phonemes

2019

View full text Add to dashboard Cite

show abstract

“…In order to predict the missing sequence of GPS points in between the trajectory, this is necessary to know the information both from previous and future timesteps. For this purpose, we make use of bidirectional encoder so that the movement patterns in both directions can be captured [ 47 ]. Due to the spatiotemporal nature of GPS trajectory data, we make use of ConvLSTM architecture [ 43 ] in our model that is able to deal with both spatial and temporal dependencies in the data.…”

Section: Approachmentioning

confidence: 99%

GPS Trajectory Completion Using End-to-End Bidirectional Convolutional Recurrent Encoder-Decoder Architecture with Attention Mechanism

Nawaz

Huang

Wang

et al. 2020

Sensors

View full text Add to dashboard Cite

GPS datasets in the big data regime provide rich contextual information that enable efficient implementation of advanced features such as navigation, tracking, and security in urban computing systems. Understanding the hidden patterns in large amount of GPS data is critically important in ubiquitous computing. The quality of GPS data is the fundamental key problem to produce high quality results. In real world applications, certain GPS trajectories are sparse and incomplete; this increases the complexity of inference algorithms. Few of existing studies have tried to address this problem using complicated algorithms that are based on conventional heuristics; this requires extensive domain knowledge of underlying applications. Our contribution in this paper are two-fold. First, we proposed deep learning based bidirectional convolutional recurrent encoder-decoder architecture to generate the missing points of GPS trajectories over occupancy grid-map. Second, we interfaced attention mechanism between enconder and decoder, that further enhance the performance of our model. We have performed the experiments on widely used Microsoft geolife trajectory dataset, and perform the experiments over multiple level of grid resolutions and multiple lengths of missing GPS segments. Our proposed model achieved better results in terms of average displacement error as compared to the state-of-the-art benchmark methods.

show abstract

“…Sainath et al (2015) combined CNNs, LSTMs, and DNNs into a unified deep learning system, which is named as CLDNN, for taking advantage of each architecture and achieved better performance than LSTM which is considered as strongest architecture of these three alternatives in speech recognition. Wang et al (2019) proposed CNN-BLSTM-CTC deep learning hybrid model for Mandarin speech recognition. They employed CNN for learning of local speech features, BLSTM for learning past and future dependencies, and CTC for decoding purposes and claim that their proposed method outperformed best existing model.…”

Section: Hybrid Approaches For Speech Recognitionmentioning

confidence: 99%

Ses Tanıma için Derin Öğrenme Mimarileri Üzerine Derleme

Dokuz

Tüfekçi

2020

European Journal of Science and Technology

View full text Add to dashboard Cite

Öz Derin öğrenme, çeşitli algoritmalar kullanarak çok sayıda işlem katmanından oluşan derin mimariler yardımıyla veri kümelerinin modelini çıkarmaya çalışan makine öğrenmesi alanının bir alt alanıdır. Derin öğrenme mimarilerinin başarılı uygulamaları ve popülerliğinden dolayı, derin öğrenme sistemleri ses tanıma alanında da kullanılmaya başlanmıştır. Araştırmacılar bu mimarileri ses tanıma ve ses tanımanın uygulamalarında, örneğin ses duygu tanıma, ses etkinliği tespiti ve konuşmacı tanıma ve doğrulama, ses girdileri ve çıktıları arasındaki modellerin daha iyi kurulması ve ses tanıma sistemlerinin hata oranlarının düşürülmesi amaçlarıyla kullanmışlardır. Literatürde, ses tanıma sistemleri için derin öğrenme mimarilerini kullanan çok sayıda çalışma yapılmıştır. Literatürde yapılmış olan çalışmalar ses tanıma ve uygulamaları için derin öğrenme mimarilerinin kullanılmasının pek çok ses tanıma alanın için fayda sağladığını ve hata oranlarını düşürerek daha iyi performans elde edilmesini sağladığını göstermiştir. Bu çalışmada, ilk olarak, ses tanıma probleminden ve ses tanıma adımlarından bahsedilmiştir. Daha sonra, derin öğrenme tabanlı ses tanıma için yapılmış olan çalışmalar incelenmiştir. Özellikle, derin öğrenme mimarilerinden olan Derin Sinir Ağları (DSA), Evrişimli Sinir Ağları (ESA) ve Özyinelemeli Sinir Ağları (ÖSA) ve bu mimarilerden üretilmiş olan hibrit yaklaşımlar değerlendirilmiş ve bu mimarilerin ses tanıma ve ses tanımanın uygulama alanlarındaki kullanımları ile ilgili literatürdeki çalışmalar değerlendirilmiştir. Sonuç olarak, hata oranları ve ses tanıma performansı açısından tüm mimariler arasında en yaygın olarak kullanılan ve en güçlü derin öğrenme mimarisinin ÖSA olduğu gözlemlenmiştir. ESA ise diğer bir başarılı derin öğrenme mimarisidir ve ses tanıma performansı ve hata oranları açısından ÖSA ile yakın sonuçlar üretmektedir. Ayrıca, hibrit derin öğrenme mimarilerinin de gittikçe yaygın hale geldiği ve ses tanıma hata oranlarını düşürebildiği gözlemlenmiştir.

show abstract

End-to-End Mandarin Speech Recognition Combining CNN and BLSTM

Cited by 31 publications

References 20 publications

Evaluation of Speech Quality Through Recognition and Classification of Phonemes

Evaluation of Speech Quality Through Recognition and Classification of Phonemes

GPS Trajectory Completion Using End-to-End Bidirectional Convolutional Recurrent Encoder-Decoder Architecture with Attention Mechanism

Ses Tanıma için Derin Öğrenme Mimarileri Üzerine Derleme

Contact Info

Product

Resources

About