Approximate search of audio queries by using DTW with phone time boundary and data augmentation

Xu, Haikua; Hou, Jingyong; Xiao, Xiong; Pham, Van Tung; Leung, Cheung-Chi; Wang, Lei; Hai, Van; Lv, Hang; Xie, Lei; Ma, Bin; Chng, Eng Siong; Li, Haizhou

doi:10.1109/icassp.2016.7472835

Cited by 13 publications

(12 citation statements)

References 10 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…[35][36][37] propose a logistic regression-based fusion of acoustic keyword spotting and DTW-based systems using language-dependent phoneme recognizers. [38][39][40][41] use a logistic regression-based fusion on DTW-and phone-based systems. Oishi et al [42] uses a DTW-based search at the HMM state-level from syllables obtained from a word-based speech recognizer and a deep neural network (DNN) posteriorgram-based rescoring, and [43] adds a logistic regression-based approach for detection rescoring.…”

Section: Hybrid Approachmentioning

confidence: 99%

ALBAYZIN Query-by-example Spoken Term Detection 2016 evaluation

Tejedor

Toledano

López-Otero

et al. 2018

J AUDIO SPEECH MUSIC PROC.

View full text Add to dashboard Cite

Query-by-example Spoken Term Detection (QbE STD) aims to retrieve data from a speech repository given an acoustic (spoken) query containing the term of interest as the input. This paper presents the systems submitted to the ALBAYZIN QbE STD 2016 Evaluation held as a part of the ALBAYZIN 2016 Evaluation Campaign at the IberSPEECH 2016 conference. Special attention was given to the evaluation design so that a thorough post-analysis of the main results could be carried out. Two different Spanish speech databases, which cover different acoustic and language domains, were used in the evaluation: the MAVIR database, which consists of a set of talks from workshops, and the EPIC database, which consists of a set of European Parliament sessions in Spanish. We present the evaluation design, both databases, the evaluation metric, the systems submitted to the evaluation, the results, and a thorough analysis and discussion. Four different research groups participated in the evaluation, and a total of eight template matching-based systems were submitted. We compare the systems submitted to the evaluation and make an in-depth analysis based on some properties of the spoken queries, such as query length, single-word/multi-word queries, and in-language/out-of-language queries.

show abstract

Section: Hybrid Approachmentioning

confidence: 99%

ALBAYZIN Query-by-example Spoken Term Detection 2016 evaluation

Tejedor

Toledano

López-Otero

et al. 2018

J AUDIO SPEECH MUSIC PROC.

View full text Add to dashboard Cite

show abstract

“…Our partial matching DTW systems, including fixedwindow [8,16] and phoneme-sequence [17] partial matching systems, were used to deal with T2 and T3 queries. In each fixed-window partial matching system, an analysis window between 70 and 90 frames long was defined.…”

Section: Dtw Systemsmentioning

confidence: 99%

“…Unsupervised acoustic modeling or feature extraction has been studied in [11][12][13][14] to deal with the lack of knowledge about target data. Partial matching techniques [15][16][17] have been developed to deal with different kinds of query matches for the QUESST 2014.…”

Section: Introductionmentioning

confidence: 99%

Toward High-Performance Language-Independent Query-by-Example Spoken Term Detection for MediaEval 2015: Post-Evaluation Analysis

Leung

Wang

et al. 2016

Interspeech 2016

Self Cite

View full text Add to dashboard Cite

This paper documents the significant components of a state-ofthe-art language-independent query-by-example spoken term detection system designed for the Query by Example Search on Speech Task (QUESST) in MediaEval 2015. We developed exact and partial matching DTW systems, and WFST based symbolic search systems to handle different types of search queries. To handle the noisy and reverberant speech in the task, we trained tokenizers using data augmented with different noise and reverberation conditions. Our postevaluation analysis showed that the phone boundary label provided by the improved tokenizers brings more accurate speech activity detection in DTW systems. We argue that acoustic condition mismatch is possibly a more important factor than language mismatch for obtaining consistent gain from stacked bottleneck features. Our post-evaluation system, involving a smaller number of component systems, can outperform our submitted systems, which performed the best for the task.

show abstract

“…the DTW distance between two speech signals; distances from others variants of DTW such as subsequence DTW [156] and partial DTW [139]…”

Section: Neural Network Classifiermentioning

confidence: 99%

“…Specifically, overlapping sub-sequences, also called partial The proposed partial search approach is motivated by the success of the partial matching approach on the query-by-example [139,140] and repeating sequence detection [141] tasks. However, in [139,140], each partial sequence is a sequence of acoustic vectors rather than a sequence of subword units. In the context of KWS, the idea of deriving sub-sequences from the subword sequence of a keyword is similar to the ngram-based approach [93,142,143].…”

Section: Introductionmentioning

confidence: 99%

Robust spoken term detection using partial search and re-scoring hypothesized detections techniques

Pham¹

Self Cite

View full text Add to dashboard Cite

show abstract

Approximate search of audio queries by using DTW with phone time boundary and data augmentation

Cited by 13 publications

References 10 publications

ALBAYZIN Query-by-example Spoken Term Detection 2016 evaluation

ALBAYZIN Query-by-example Spoken Term Detection 2016 evaluation

Toward High-Performance Language-Independent Query-by-Example Spoken Term Detection for MediaEval 2015: Post-Evaluation Analysis

Robust spoken term detection using partial search and re-scoring hypothesized detections techniques

Contact Info

Product

Resources

About