Keyword Search Using Attention-Based End-to-End ASR and Frame-Synchronous Phoneme Alignments

Yang, Runyan; Cheng, Gaofeng; Miao, Haoran; Ta, Li; Zhang, Pengyuan; Yan, Yonghong

doi:10.1109/taslp.2021.3120632

Cited by 12 publications

(4 citation statements)

References 41 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…During the inference stage, we retrieve keywords within ASR 2-best hypotheses. During KWS scoring, a predicted keyword occurrence is considered correct when there is a 50% time overlap at least between the predicted occurrence and a reference occurrence of the same keyword [36]. The results are shown in Table 8.…”

Section: Speaker Diarizationmentioning

confidence: 99%

Open Source MagicData-RAMC: A Rich Annotated Mandarin Conversational(RAMC) Speech Dataset

Zhang¹,

Chen²,

Li³

et al. 2022

Preprint

Self Cite

View full text Add to dashboard Cite

This paper introduces a high-quality rich annotated Mandarin conversational (RAMC) speech dataset called MagicData-RAMC. The MagicData-RAMC corpus contains 180 hours of conversational speech data recorded from native speakers of Mandarin Chinese over mobile phones with a sampling rate of 16 kHz. The dialogs in MagicData-RAMC are classified into 15 diversified domains and tagged with topic labels, ranging from science and technology to ordinary life. Accurate transcription and precise speaker voice activity timestamps are manually labeled for each sample. Speakers' detailed information is also provided. As a Mandarin speech dataset designed for dialog scenarios with high quality and rich annotations, MagicData-RAMC enriches the data diversity in the Mandarin speech community and allows extensive research on a series of speechrelated tasks, including automatic speech recognition, speaker diarization, topic detection, keyword search, text-to-speech, etc. We also conduct several relevant tasks and provide experimental results to help evaluate the dataset.

show abstract

Section: Speaker Diarizationmentioning

confidence: 99%

Open Source MagicData-RAMC: A Rich Annotated Mandarin Conversational(RAMC) Speech Dataset

Zhang¹,

Chen²,

Li³

et al. 2022

Preprint

Self Cite

View full text Add to dashboard Cite

show abstract

“…Each dot represents a language represented in the corpus. In [18], we improved the alignments and phonemic transcripts of 48 languages in the companion corpus by using the zero-resource acoustic modelling approaches discussed in this dissertation [19][20][21][22]. This allowed for the first time the systematic investigation of phonetic typology across a wide range of languages.…”

Section: Introductionmentioning

confidence: 99%

Development of efficient techniques for ASR System for Speech Detection and Recognization system using Gaussian Mixture Model- Universal Background Model

Veera V Rama Rao M

2023

IJRITCC

View full text Add to dashboard Cite

Some practical uses of ASR have been implemented, including the transcription of meetings and the usage of smart speakers. It is the process by which speech waves are transformed into text that allows computers to interpret and act upon human speech. Scalable strategies for developing ASR systems in languages where no voice transcriptions or pronunciation dictionaries exist are the primary focus of this work. We first show that the necessity for voice transcription into the target language can be greatly reduced through cross-lingual acoustic model transfer when phonemic pronunciation lexicons exist in the new language. Afterwards, we investigate three approaches to dealing with languages that lack a pronunciation lexicon. Secondly, we have a look at the efficiency of graphemic acoustic model transfer, which makes it easy to build pronunciation dictionaries. Thesis problems can be solved, in part, by investigating optimization strategies for training on huge corpora (such as GA+HMM and DE+HMM). In the training phase of acoustic modelling, the suggested method is applied to traditional methods. Read speech and HMI voice experiments indicated that while each data augmentation strategy alone did not always increase recognition performance, using all three techniques together did. Power normalised cepstral coefficient (PNCC) features are tweaked somewhat in this work to enhance verification accuracy. To increase speaker verification accuracy, we suggest employing multiple “Gaussian Mixture Model-Universal Background Model (GMM-UBM) and SVM classifiers”. Importantly, pitch shift data augmentation and multi-task training reduced bias by more than 18% absolute compared to the baseline system for read speech, and applying all three data augmentation techniques during fine tuning reduced bias by more than 7% for HMI speech, while increasing recognition accuracy of both native and non-native Dutch speech.

show abstract

“…Automatic speech recognition (ASR) has a long history of research (Bahl et al, 1983;Hinton et al, 2012;Chu et al, 2020). By audio signal processing and modeling, speech contents can be transcribed into texts for various applications (Yu and Deng, 2016;Yang et al, 2021). Yet in particular cases, the audio signals cannot be clearly produced or captured.…”

Section: Introductionmentioning

confidence: 99%

A novel silent speech recognition approach based on parallel inception convolutional neural network and Mel frequency spectral coefficient

Zhang

Xie

et al. 2022

Front. Neurorobot.

View full text Add to dashboard Cite

Silent speech recognition breaks the limitations of automatic speech recognition when acoustic signals cannot be produced or captured clearly, but still has a long way to go before being ready for any real-life applications. To address this issue, we propose a novel silent speech recognition framework based on surface electromyography (sEMG) signals. In our approach, a new deep learning architecture Parallel Inception Convolutional Neural Network (PICNN) is proposed and implemented in our silent speech recognition system, with six inception modules processing six channels of sEMG data, separately and simultaneously. Meanwhile, Mel Frequency Spectral Coefficients (MFSCs) are employed to extract speech-related sEMG features for the first time. We further design and generate a 100-class dataset containing daily life assistance demands for the elderly and disabled individuals. The experimental results obtained from 28 subjects confirm that our silent speech recognition method outperforms state-of-the-art machine learning algorithms and deep learning architectures, achieving the best recognition accuracy of 90.76%. With sEMG data collected from four new subjects, efficient steps of subject-based transfer learning are conducted to further improve the cross-subject recognition ability of the proposed model. Promising results prove that our sEMG-based silent speech recognition system could have high recognition accuracy and steady performance in practical applications.

show abstract

Keyword Search Using Attention-Based End-to-End ASR and Frame-Synchronous Phoneme Alignments

Cited by 12 publications

References 41 publications

Open Source MagicData-RAMC: A Rich Annotated Mandarin Conversational(RAMC) Speech Dataset

Open Source MagicData-RAMC: A Rich Annotated Mandarin Conversational(RAMC) Speech Dataset

Development of efficient techniques for ASR System for Speech Detection and Recognization system using Gaussian Mixture Model- Universal Background Model

A novel silent speech recognition approach based on parallel inception convolutional neural network and Mel frequency spectral coefficient

Contact Info

Product

Resources

About