Audio segmentation-by-classification approach based on factor analysis in broadcast news domain

Castán, Diego; Giménez, Alfonso Ortega; Miguel, Antonio; Lleida, Eduardo

doi:10.1186/s13636-014-0034-5

Cited by 20 publications

(13 citation statements)

References 39 publications

(49 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The winner team of the original Albayzín 2010 evaluation proposed a segmentation by classification approach based on a hierarchical GMM/HMM (dark blue) including MFCCs, chroma and spectral entropy as input feature [65]. The best result so far in this database was obtained with a solution based on factor analysis combined with a Gaussian backend (orange) and MFCCs with 1st and 2nd order derivatives as input features [17]. Our three previously explained final results combining the RNN classifier and the HMM resegmentation are also presented: the RNN baseline (purple), the BLSTM 1 PoolBLSTM 2 RNN approach (green) and the BLSTM 1 PoolBLSTM 2 RNN trained using mixup augmentation (light blue).…”

Section: Discussionmentioning

confidence: 99%

See 1 more Smart Citation

Multiclass audio segmentation based on recurrent neural networks for broadcast domain data

Gimeno

Viñals

Giménez

et al. 2020

J AUDIO SPEECH MUSIC PROC.

Self Cite

View full text Add to dashboard Cite

This paper presents a new approach based on recurrent neural networks (RNN) to the multiclass audio segmentation task whose goal is to classify an audio signal as speech, music, noise or a combination of these. The proposed system is based on the use of bidirectional long short-term Memory (BLSTM) networks to model temporal dependencies in the signal. The RNN is complemented by a resegmentation module, gaining long term stability by means of the tied state concept in hidden Markov models. We explore different neural architectures introducing temporal pooling layers to reduce the neural network output sampling rate. Our findings show that removing redundant temporal information is beneficial for the segmentation system showing a relative improvement close to 5%. Furthermore, this solution does not increase the number of parameters of the model and reduces the number of operations per second, allowing our system to achieve a real-time factor below 0.04 if running on CPU and below 0.03 if running on GPU. This new architecture combined with a data-agnostic data augmentation technique called mixup allows our system to achieve competitive results in both the Albayzín 2010 and 2012 evaluation datasets, presenting a relative improvement of 19.72% and 5.35% compared to the best results found in the literature for these databases.

show abstract

Section: Discussionmentioning

confidence: 99%

“…Multistage decision trees are used in [16] with the same objective of discriminating speech and music. The factor analysis (FA) technique, usually applied in speaker verification, is adapted to audio segmentation domain by Castán et al in [17] obtaining relevant results for broadcast domain data.…”

Section: Audio Segmentation Approachesmentioning

confidence: 99%

Multiclass audio segmentation based on recurrent neural networks for broadcast domain data

Gimeno

Viñals

Giménez

et al. 2020

J AUDIO SPEECH MUSIC PROC.

Self Cite

View full text Add to dashboard Cite

show abstract

“…al. in [9]. Such approach based on classifying consecutive audio frames, where the segmentation is performed by an analysis of the sequence of decisions.…”

Section: Related Workmentioning

confidence: 99%

Change Point Determination in Audio Data Using Auditory Features

Maka¹

2015

International Journal of Electronics and Telecommunications

View full text Add to dashboard Cite

Abstract-The study is aimed to investigate the properties of auditory-based features for audio change point detection process. In the performed analysis, two popular techniques have been used: a metric-based approach and the ∆BIC scheme. The efficiency of the change point detection process depends on the type and size of the feature space. Therefore, we have compared two auditory-based feature sets (MFCC and GTEAD) in both change point detection schemes. We have proposed a new technique based on multiscale analysis to determine the content change in the audio data. The comparison of the two typical change point detection techniques with two different feature spaces has been performed on the set of acoustical scenes with single change point. As the results show, the accuracy of the detected positions depends on the feature type, feature space dimensionality, detection technique and the type of audio data. In case of the ∆BIC approach, the better accuracy has been obtained for MFCC feature space in the most cases. However, the change point detection with this feature results in a lower detection ratio in comparison to the GTEAD feature. Using the same criteria as for ∆BIC, the proposed multiscale metric-based technique has been executed. In such case, the use of the GTEAD feature space has led to better accuracy. We have shown that the proposed multiscale change point detection scheme is competitive to the ∆BIC scheme with the MFCC feature space.

show abstract

“…This paper describes the database and the evaluation process and summarizes the results obtained. in Spanish [8][9][10][11], and more recently, the Multi-Genre Broadcast (MGB) Challenge with data in English and Arabic 2 [12][13][14]. In other areas apart from broadcast speech, several evaluation campaigns have been proposed such as the ones organized in the scope of the Zero Resource Speech Challenge [15,16], the TC-STAR evaluation on recordings of the European Parliament's sessions in English and Spanish [5], or the MediaEval evaluation of multimodal search and hyperlinking [17].As a way to measure the performance of different techniques and approaches, in this 2018 edition, the IberSpeech-RTVE Challenge Evaluation campaign was proposed in three different conditions: speech-to-text transcription (STT), speaker diarization (SD), and multimodal diarization (MD).…”

mentioning

confidence: 99%

“…For the evaluation, three television programs were distributed, one from "La Mañana" and two from "La Tarde en 24H Tertulia", which totaled four hours. For enrollment, photos (10) and video (20 s) of the 39 characters to be labeled were provided.…”

mentioning

confidence: 99%

Albayzin 2018 Evaluation: The IberSpeech-RTVE Challenge on Speech Technologies for Spanish Broadcast Media

et al. 2019

Self Cite

View full text Add to dashboard Cite

The IberSpeech-RTVE Challenge presented at IberSpeech 2018 is a new Albayzin evaluation series supported by the Spanish Thematic Network on Speech Technologies (Red Temática en Tecnologías del Habla (RTTH)). That series was focused on speech-to-text transcription, speaker diarization, and multimodal diarization of television programs. For this purpose, the Corporacion Radio Television Española (RTVE), the main public service broadcaster in Spain, and the RTVE Chair at the University of Zaragoza made more than 500 h of broadcast content and subtitles available for scientists. The dataset included about 20 programs of different kinds and topics produced and broadcast by RTVE between 2015 and 2018. The programs presented different challenges from the point of view of speech technologies such as: the diversity of Spanish accents, overlapping speech, spontaneous speech, acoustic variability, background noise, or specific vocabulary. This paper describes the database and the evaluation process and summarizes the results obtained. in Spanish [8][9][10][11], and more recently, the Multi-Genre Broadcast (MGB) Challenge with data in English and Arabic 2 [12][13][14]. In other areas apart from broadcast speech, several evaluation campaigns have been proposed such as the ones organized in the scope of the Zero Resource Speech Challenge [15,16], the TC-STAR evaluation on recordings of the European Parliament's sessions in English and Spanish [5], or the MediaEval evaluation of multimodal search and hyperlinking [17].As a way to measure the performance of different techniques and approaches, in this 2018 edition, the IberSpeech-RTVE Challenge Evaluation campaign was proposed in three different conditions: speech-to-text transcription (STT), speaker diarization (SD), and multimodal diarization (MD). Twenty-two teams registered to the challenge, and eighteen submitted systems in at least one of the three proposed tasks. In this paper, we describe the challenge and the data provided by the organization to the participants. We also provide a description of the systems presented to the evaluation, their results, and a set of conclusions that can be drawn from this evaluation campaign.This paper is organized as follows. In Section 2, the RTVE2018 database is presented. Section 3 describes the three evaluation tasks, speech-to-text transcription, speaker diarization, and multimodal diarization. Section 4 provides a brief description of the main features of the submitted systems. Section 5 presents results, and Section 6 gives conclusions.

show abstract

Audio segmentation-by-classification approach based on factor analysis in broadcast news domain

Cited by 20 publications

References 39 publications

Multiclass audio segmentation based on recurrent neural networks for broadcast domain data

Multiclass audio segmentation based on recurrent neural networks for broadcast domain data

Change Point Determination in Audio Data Using Auditory Features

Albayzin 2018 Evaluation: The IberSpeech-RTVE Challenge on Speech Technologies for Spanish Broadcast Media

Contact Info

Product

Resources

About