End-to-End Keyword Search Based on Attention and Energy Scorer for Low Resource Languages

Zhao, Zeyu; Zhang, Weiqiang

doi:10.21437/interspeech.2020-2613

Cited by 9 publications

(8 citation statements)

References 14 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The integration of an attention mechanism (including a variant called multi-head attention [144]) in (primarily) Seq2Seq acoustic models in order to focus on the keyword(s) of interest has successfully been accomplished by a number of works, e.g., [26], [32], [60], [68], [133], [143], [145]. These works find that incorporating attention provides KWS performance gains with respect to counterpart Seq2Seq models without attention.…”

Section: ) the Attention Mechanismmentioning

confidence: 99%

Deep Spoken Keyword Spotting: An Overview

López-Espejo¹,

Hansen²,

Jensen³

2022

IEEE Access

View full text Add to dashboard Cite

Spoken keyword spotting (KWS) deals with the identification of keywords in audio streams and has become a fast-growing technology thanks to the paradigm shift introduced by deep learning a few years ago. This has allowed the rapid embedding of deep KWS in a myriad of small electronic devices with different purposes like the activation of voice assistants. Prospects suggest a sustained growth in terms of social use of this technology. Thus, it is not surprising that deep KWS has become a hot research topic among speech scientists, who constantly look for KWS performance improvement and computational complexity reduction. This context motivates this paper, in which we conduct a literature review into deep spoken KWS to assist practitioners and researchers who are interested in this technology. Specifically, this overview has a comprehensive nature by covering a thorough analysis of deep KWS systems (which includes speech features, acoustic modeling and posterior handling), robustness methods, applications, datasets, evaluation metrics, performance of deep KWS systems and audio-visual KWS. The analysis performed in this paper allows us to identify a number of directions for future research, including directions adopted from automatic speech recognition research and directions that are unique to the problem of spoken KWS.INDEX TERMS Keyword spotting, deep learning, acoustic model, small footprint, robustness.

show abstract

Section: ) the Attention Mechanismmentioning

confidence: 99%

Deep Spoken Keyword Spotting: An Overview

López-Espejo¹,

Hansen²,

Jensen³

2022

IEEE Access

View full text Add to dashboard Cite

show abstract

“…Deep learning and, in particular, end-to-end systems were also recently investigated to solve the STD problem directly. In this direction, several end-to-end ASR-free approaches for STD were proposed [13,[34][35][36]. In addition to exploring neural end-to-end approaches, deep learning is extensively used to extract representations (embeddings) of audio documents and query terms that facilitate the search [20,21,23,25].…”

Section: Spoken Term Detectionmentioning

confidence: 99%

“…This program focused on building fully automatic and noise-robust speech recognition and search systems in a very limited amount of time (e.g., one week) and with limited amount of training data. The languages addressed in that program were low-resourced, such as Cantonese, Pashto, Tagalog, Turkish, Vietnamese, Swahili, Tamil and so on, and significant research has been carried out [13,61,[147][148][149][150][151][152][153][154][155][156][157][158][159].…”

Section: Comparison With Previous Std International Evaluationsmentioning

confidence: 99%

“…The huge amount of information stored in audio and audiovisual repositories makes it necessary to develop efficient methods for search on speech (SoS). Significant research has been carried out for years in this area, and, in particular, in the tasks of spoken document retrieval (SDR) [1][2][3][4][5][6], keyword spotting (KWS) [7][8][9][10][11][12][13], spoken term detection (STD) [14][15][16][17][18][19][20][21][22][23][24][25] and query-by-example spoken term detection (QbE STD) [26][27][28][29][30][31].…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

The Multi-Domain International Search on Speech 2020 ALBAYZIN Evaluation: Overview, Systems, Results, Discussion and Post-Evaluation Analyses

et al. 2021

View full text Add to dashboard Cite

The large amount of information stored in audio and video repositories makes search on speech (SoS) a challenging area that is continuously receiving much interest. Within SoS, spoken term detection (STD) aims to retrieve speech data given a text-based representation of a search query (which can include one or more words). On the other hand, query-by-example spoken term detection (QbE STD) aims to retrieve speech data given an acoustic representation of a search query. This is the first paper that presents an internationally open multi-domain evaluation for SoS in Spanish that includes both STD and QbE STD tasks. The evaluation was carefully designed so that several post-evaluation analyses of the main results could be carried out. The evaluation tasks aim to retrieve the speech files that contain the queries, providing their start and end times and a score that reflects how likely the detection within the given time intervals and speech file is. Three different speech databases in Spanish that comprise different domains were employed in the evaluation: the MAVIR database, which comprises a set of talks from workshops; the RTVE database, which includes broadcast news programs; and the SPARL20 database, which contains Spanish parliament sessions. We present the evaluation itself, the three databases, the evaluation metric, the systems submitted to the evaluation, the evaluation results and some detailed post-evaluation analyses based on specific query properties (in-vocabulary/out-of-vocabulary queries, single-word/multi-word queries and native/foreign queries). The most novel features of the submitted systems are a data augmentation technique for the STD task and an end-to-end system for the QbE STD task. The obtained results suggest that there is clearly room for improvement in the SoS task and that performance is highly sensitive to changes in the data domain.

show abstract

“…However, the approaches mentioned above have two disadvantages: (1) they are ASR-free and designed for a small number of the keywords of interest, and (2) they neglect the timestamps of keywords. Some people work on ASRfree multi-keyword detection [23][24][25], but the timestamps of keywords are still neglected. Nonetheless, in some practical applications, the timestamps of a large amount of keywords are still required.…”

Section: Introductionmentioning

confidence: 99%

Timestamp-aligning and keyword-biasing end-to-end ASR front-end for a KWS system

Shi

Zhang

Wang

et al. 2021

J AUDIO SPEECH MUSIC PROC.

Self Cite

View full text Add to dashboard Cite

Many end-to-end approaches have been proposed to detect predefined keywords. For scenarios of multi-keywords, there are still two bottlenecks that need to be resolved: (1) the distribution of important data that contains keyword(s) is sparse, and (2) the timestamps of the detected keywords are inaccurate. In this paper, to alleviate the first issue and further improve the performance of the end-to-end ASR front-end, we propose the biased loss function for guiding the recognizer to pay more attention to the speech segments containing the predefined keywords. As for the second issue, we solve this problem by modifying the force alignment applied to the end-to-end ASR front-end. To get the frame-level alignment, we utilize a Gaussian Mixture Model-Hidden Markov Model (GMM-HMM) based acoustic model (AM) for auxiliary. The proposed system is evaluated in the OpenSAT20 held by the National Institute of Standards and Technology (NIST). The performance of our end-to-end KWS system is comparable to the conventional hybrid KWS system, sometimes even slightly better. With fusion results of the end-to-end and conventional KWS systems, we won the first prize in the KWS track. On the dev dataset (a part of SAFE-T corpus), the system outperforms the baseline by a large margin, i.e., our system with GMM-HMM aligner has a lower segmentation-aware word error rates (relatively 7.9–19.2% decrease) and higher overall Actual term-weighted values (relatively 3.6–11.0% increase), which demonstrates the effectiveness of the proposed method. For more precise alignments, we can use DNN-based AM as alignmentor at the cost of more computation.

show abstract

End-to-End Keyword Search Based on Attention and Energy Scorer for Low Resource Languages

Cited by 9 publications

References 14 publications

Deep Spoken Keyword Spotting: An Overview

Deep Spoken Keyword Spotting: An Overview

The Multi-Domain International Search on Speech 2020 ALBAYZIN Evaluation: Overview, Systems, Results, Discussion and Post-Evaluation Analyses

Timestamp-aligning and keyword-biasing end-to-end ASR front-end for a KWS system

Contact Info

Product

Resources

About