Toward High-Performance Language-Independent Query-by-Example Spoken Term Detection for MediaEval 2015: Post-Evaluation Analysis

Leung, Cheung-Chi; Wang, Lei; Xu, Haihua; Hou, Jingyong; Pham, Van Tung; Lv, Hang; Xie, Lei; Xiao, Xiong; Ni, Chuanfa; Ma, Bin; Chng, Eng Siong; Li, Haizhou

doi:10.21437/interspeech.2016-691

Cited by 20 publications

(18 citation statements)

References 26 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…An information retrieval technique to hypothesize detection and DTW-based score detection are proposed in [39]. Logistic regression-based fusion on DTW and phone-based systems is employed in [71][72][73][74]. DTWbased search at the HMM state-level from syllables obtained from a word-based speech recognizer and a deep neural network (DNN) posteriorgram-based rescoring are employed in [75], and [76] adds a logistic regression-based approach for detection rescoring.…”

Section: Hybrid Methodsmentioning

confidence: 99%

Search on speech from spoken queries: the Multi-domain International ALBAYZIN 2018 Query-by-Example Spoken Term Detection Evaluation

Tejedor

Toledano

López-Otero

et al. 2019

J AUDIO SPEECH MUSIC PROC.

View full text Add to dashboard Cite

The huge amount of information stored in audio and video repositories makes search on speech (SoS) a priority area nowadays. Within SoS, Query-by-Example Spoken Term Detection (QbE STD) aims to retrieve data from a speech repository given a spoken query. Research on this area is continuously fostered with the organization of QbE STD evaluations. This paper presents a multi-domain internationally open evaluation for QbE STD in Spanish. The evaluation aims at retrieving the speech files that contain the queries, providing their start and end times, and a score that reflects the confidence given to the detection. Three different Spanish speech databases that encompass different domains have been employed in the evaluation: MAVIR database, which comprises a set of talks from workshops; RTVE database, which includes broadcast television (TV) shows; and COREMAH database, which contains 2-people spontaneous speech conversations about different topics. The evaluation has been designed carefully so that several analyses of the main results can be carried out. We present the evaluation itself, the three databases, the evaluation metrics, the systems submitted to the evaluation, the results, and the detailed post-evaluation analyses based on some query properties (within-vocabulary/out-of-vocabulary queries, single-word/multi-word queries, and native/foreign queries). Fusion results of the primary systems submitted to the evaluation are also presented. Three different teams took part in the evaluation, and ten different systems were submitted. The results suggest that the QbE STD task is still in progress, and the performance of these systems is highly sensitive to changes in the data domain. Nevertheless, QbE STD strategies are able to outperform text-based STD in unseen data domains.

show abstract

Section: Hybrid Methodsmentioning

confidence: 99%

Search on speech from spoken queries: the Multi-domain International ALBAYZIN 2018 Query-by-Example Spoken Term Detection Evaluation

Tejedor

Toledano

López-Otero

et al. 2019

J AUDIO SPEECH MUSIC PROC.

View full text Add to dashboard Cite

show abstract

“…[35][36][37] propose a logistic regression-based fusion of acoustic keyword spotting and DTW-based systems using language-dependent phoneme recognizers. [38][39][40][41] use a logistic regression-based fusion on DTW-and phone-based systems. Oishi et al [42] uses a DTW-based search at the HMM state-level from syllables obtained from a word-based speech recognizer and a deep neural network (DNN) posteriorgram-based rescoring, and [43] adds a logistic regression-based approach for detection rescoring.…”

Section: Hybrid Approachmentioning

confidence: 99%

ALBAYZIN Query-by-example Spoken Term Detection 2016 evaluation

Tejedor

Toledano

López-Otero

et al. 2018

J AUDIO SPEECH MUSIC PROC.

View full text Add to dashboard Cite

Query-by-example Spoken Term Detection (QbE STD) aims to retrieve data from a speech repository given an acoustic (spoken) query containing the term of interest as the input. This paper presents the systems submitted to the ALBAYZIN QbE STD 2016 Evaluation held as a part of the ALBAYZIN 2016 Evaluation Campaign at the IberSPEECH 2016 conference. Special attention was given to the evaluation design so that a thorough post-analysis of the main results could be carried out. Two different Spanish speech databases, which cover different acoustic and language domains, were used in the evaluation: the MAVIR database, which consists of a set of talks from workshops, and the EPIC database, which consists of a set of European Parliament sessions in Spanish. We present the evaluation design, both databases, the evaluation metric, the systems submitted to the evaluation, the results, and a thorough analysis and discussion. Four different research groups participated in the evaluation, and a total of eight template matching-based systems were submitted. We compare the systems submitted to the evaluation and make an in-depth analysis based on some properties of the spoken queries, such as query length, single-word/multi-word queries, and in-language/out-of-language queries.

show abstract

“…Retrieving spoken content with spoken queries, also known as queryby-example spoken term detection (STD) [1][2][3][4][5][6], is attractive because hand-held or wearable devices make spoken queries a natural choice. The most intuitive way to search over spoken content for a spoken query is to directly match the audio signals to find those audio snippets that sound like the spoken query, and dynamic time warping (DTW) [7] is widely used.…”

Section: Introductionmentioning

confidence: 99%

Query-by-Example Spoken Term Detection Using Attention-Based Multi-Hop Networks

Lee

2018

2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

Retrieving spoken content with spoken queries, or query-by-example spoken term detection (STD), is attractive because it makes possible the matching of signals directly on the acoustic level without transcribing them into text. Here, we propose an end-to-end queryby-example STD model based on an attention-based multi-hop network, whose input is a spoken query and an audio segment containing several utterances; the output states whether the audio segment includes the query. The model can be trained in either a supervised scenario using labeled data, or in an unsupervised fashion. In the supervised scenario, we find that the attention mechanism and multiple hops improve performance, and that the attention weights indicate the time span of the detected terms. In the unsupervised setting, the model mimics the behavior of DTW, and it performs as well as DTW but with a lower run-time complexity.

show abstract

Toward High-Performance Language-Independent Query-by-Example Spoken Term Detection for MediaEval 2015: Post-Evaluation Analysis

Cited by 20 publications

References 26 publications

Search on speech from spoken queries: the Multi-domain International ALBAYZIN 2018 Query-by-Example Spoken Term Detection Evaluation

Search on speech from spoken queries: the Multi-domain International ALBAYZIN 2018 Query-by-Example Spoken Term Detection Evaluation

ALBAYZIN Query-by-example Spoken Term Detection 2016 evaluation

Query-by-Example Spoken Term Detection Using Attention-Based Multi-Hop Networks

Contact Info

Product

Resources

About