CTC-Segmentation of Large Corpora for German End-to-End Speech Recognition

Kürzinger, Ludwig; Winkelbauer, Dominik; Li, Lujun; Watzel, Tobias; Rigoll, Gerhard

doi:10.1007/978-3-030-60276-5_27

Cited by 50 publications

(32 citation statements)

References 6 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In addition, the CTC-based algorithm has an alignment function and can output alignment information. Recently, Ludwig Kürzinge et al proposed to use a CTC-based network for the segmentation task and it outperforms the other existing segmentation tools [41]. It should be noted that here "segmentation task" in [41] is similar to the SAD task.…”

Section: Selection Of the Force Alignment Module For Asr And Kwsmentioning

confidence: 99%

“…Therefore, it is possible to compute all possible maximum joint probabilities for aligning the text via dynamic programming. In [41], the CTC network is used for utterance-level segments and outperforms other existing segmentation tools. However, because CTC is also a kind of sequence modeling, the CTC loss is actually the sum of the probabilities of multiple alignment paths.…”

Section: Selection Of the Force Alignment Module For Asr And Kwsmentioning

confidence: 99%

See 1 more Smart Citation

Timestamp-aligning and keyword-biasing end-to-end ASR front-end for a KWS system

Shi

Zhang

Wang

et al. 2021

J AUDIO SPEECH MUSIC PROC.

View full text Add to dashboard Cite

Many end-to-end approaches have been proposed to detect predefined keywords. For scenarios of multi-keywords, there are still two bottlenecks that need to be resolved: (1) the distribution of important data that contains keyword(s) is sparse, and (2) the timestamps of the detected keywords are inaccurate. In this paper, to alleviate the first issue and further improve the performance of the end-to-end ASR front-end, we propose the biased loss function for guiding the recognizer to pay more attention to the speech segments containing the predefined keywords. As for the second issue, we solve this problem by modifying the force alignment applied to the end-to-end ASR front-end. To get the frame-level alignment, we utilize a Gaussian Mixture Model-Hidden Markov Model (GMM-HMM) based acoustic model (AM) for auxiliary. The proposed system is evaluated in the OpenSAT20 held by the National Institute of Standards and Technology (NIST). The performance of our end-to-end KWS system is comparable to the conventional hybrid KWS system, sometimes even slightly better. With fusion results of the end-to-end and conventional KWS systems, we won the first prize in the KWS track. On the dev dataset (a part of SAFE-T corpus), the system outperforms the baseline by a large margin, i.e., our system with GMM-HMM aligner has a lower segmentation-aware word error rates (relatively 7.9–19.2% decrease) and higher overall Actual term-weighted values (relatively 3.6–11.0% increase), which demonstrates the effectiveness of the proposed method. For more precise alignments, we can use DNN-based AM as alignmentor at the cost of more computation.

show abstract

Section: Selection Of the Force Alignment Module For Asr And Kwsmentioning

confidence: 99%

Section: Selection Of the Force Alignment Module For Asr And Kwsmentioning

confidence: 99%

Timestamp-aligning and keyword-biasing end-to-end ASR front-end for a KWS system

Shi

Zhang

Wang

et al. 2021

J AUDIO SPEECH MUSIC PROC.

View full text Add to dashboard Cite

show abstract

“…Typical framesynchronous alignment methods require frame-wise prediction by pre-trained ASR models. Recently, a DNN-based method referred to as CTC-Segmentation [21] has been proposed. CTC-Segmentation generates frame-wise token posteriors using CTC, one of the end-to-end neural network models, and then the alignment is estimated by finding an optimal path from the CTC trellis based on the generated posteriors.…”

Section: Alignment Approaches 21 Frame-synchronous Alignmentmentioning

confidence: 99%

“…A traditional approach aligns the text by finding an optimal path from the HMM trellis using Viterbi algorithm [19,20]. A similar work based on connectionist temporal classification (CTC) model has also been proposed recently [21]. In, [22,23], the long audio recordings are firstly recognized by a pre-trained ASR model, then the alignment is performed based on text matching between the recognized text and manual transcripts.…”

Section: Introductionmentioning

confidence: 99%

Label-Synchronous Speech-to-Text Alignment for ASR Using Forward and Backward Transformers

Kida¹,

Komatsu²,

Togami³

2021

Preprint

View full text Add to dashboard Cite

This paper proposes a novel label-synchronous speech-to-text alignment technique for automatic speech recognition (ASR). The speech-to-text alignment is a problem of splitting long audio recordings with un-aligned transcripts into utterancewise pairs of speech and text. Unlike conventional methods based on frame-synchronous prediction, the proposed method re-defines the speech-to-text alignment as a label-synchronous text mapping problem. This enables an accurate alignment benefiting from the strong inference ability of the state-of-the-art attention-based encoder-decoder models, which cannot be applied to the conventional methods. Two different Transformer models named forward Transformer and backward Transformer are respectively used for estimating an initial and final tokens of a given speech segment based on end-of-sentence prediction with teacher-forcing. Experiments using the corpus of spontaneous Japanese (CSJ) demonstrate that the proposed method provides an accurate utterance-wise alignment, that matches the manually annotated alignment with as few as 0.2% errors. It is also confirmed that a Transformer-based hybrid CTC/Attention ASR model using the aligned speech and text pairs as an additional training data reduces character error rates relatively up to 59.0%, which is significantly better than 39.0% reduction by a conventional alignment method based on connectionist temporal classification model.

show abstract

“…One reason is that model ASR has increasingly shifted towards end-to-end training using loss functions like CTC [9] that disregards precise frame alignment. Only a few works explored using neural networks to perform segmentation of sentences [10] and phones [11,12,13]. These works demonstrate great potentials for neural forced alignment, but they still required text transcriptions.…”

Section: Introductionmentioning

confidence: 99%

Phone-to-audio alignment without text: A Semi-supervised Approach

Zhu¹,

Zhang²,

Jurgens³

2021

Preprint

View full text Add to dashboard Cite

The task of phone-to-audio alignment has many applications in speech research. Here we introduce two Wav2Vec2-based models for both text-dependent and text-independent phoneto-audio alignment. The proposed Wav2Vec2-FS, a semisupervised model, directly learns phone-to-audio alignment through contrastive learning and a forward sum loss, and can be coupled with a pretrained phone recognizer to achieve textindependent alignment. The other model, Wav2Vec2-FC, is a frame classification model trained on forced aligned labels that can both perform forced alignment and text-independent segmentation. Evaluation results suggest that both proposed methods, even when transcriptions are not available, generate highly close results to existing forced alignment tools. Our work presents a neural pipeline of fully automated phone-toaudio alignment. Code and pretrained models are available at https://github.com/lingjzhu/charsiu.

show abstract

CTC-Segmentation of Large Corpora for German End-to-End Speech Recognition

Cited by 50 publications

References 6 publications

Timestamp-aligning and keyword-biasing end-to-end ASR front-end for a KWS system

Timestamp-aligning and keyword-biasing end-to-end ASR front-end for a KWS system

Label-Synchronous Speech-to-Text Alignment for ASR Using Forward and Backward Transformers

Phone-to-audio alignment without text: A Semi-supervised Approach

Contact Info

Product

Resources

About