Self-attention Aligner: A Latency-control End-to-end Model for ASR Using Self-attention Network and Chunk-hopping

Dong, Linhao; Wang, Feng; Xu, Bo

doi:10.1109/icassp.2019.8682954

Cited by 80 publications

(65 citation statements)

References 21 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…For AISHELL-2, we use all the train data (1000 hours) for training, mix the three development sets for validation and use the three test sets for evaluation. For HKUST, we use the same training (∼168 hours), validation and evaluation set as [15]. The training of LM on AISHELL-2 and HKUST uses the text from respective training set.…”

Section: Methodsmentioning

confidence: 99%

CIF: Continuous Integrate-And-Fire for End-To-End Speech Recognition

Dong

2020

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Self Cite

View full text Add to dashboard Cite

In this paper, we propose a novel soft and monotonic alignment mechanism used for sequence transduction. It is inspired by the integrate-and-fire model in spiking neural networks and employed in the encoder-decoder framework consists of continuous functions, thus being named as: Continuous Integrate-and-Fire (CIF). Applied to the ASR task, CIF not only shows a concise calculation, but also supports online recognition and acoustic boundary positioning, thus suitable for various ASR scenarios. Several support strategies are also proposed to alleviate the unique problems of CIF-based model. With the joint action of these methods, the CIF-based model shows competitive performance. Notably, it achieves a word error rate (WER) of 2.86% on the test-clean of Librispeech and creates new state-of-the-art result on Mandarin telephone ASR benchmark.Index Termscontinuous integrate-and-fire, end-to-end model, soft and monotonic alignment, online speech recognition, acoustic boundary positioning

show abstract

Section: Methodsmentioning

confidence: 99%

CIF: Continuous Integrate-And-Fire for End-To-End Speech Recognition

Dong

2020

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Self Cite

View full text Add to dashboard Cite

show abstract

“…Recently, there have been several works that have applied selfattention mechanism in speech recognition and achieved comparable results with traditional hybrid models [6,17,21]. Different from these, we introduce self-attention mechanism into transducer-based model.…”

Section: Related Workmentioning

confidence: 99%

“…We also propose a chunk-flow mechanism to realize online decoding. Different from chunk-hopping mechanism in [21], which segments an entire utterance into several overlapped chunks as the inputs, we utilize a sliding window at each layer to limit the scope of the self-attention. Chunk-flow mechanism is more analogous to the time-restricted self-attention layer [16].…”

Section: Related Workmentioning

confidence: 99%

Self-Attention Transducers for End-to-End Speech Recognition

Tian¹,

Yi²,

Tao³

et al. 2019

Interspeech 2019

View full text Add to dashboard Cite

Recurrent neural network transducers (RNN-T) have been successfully applied in end-to-end speech recognition. However, the recurrent structure makes it difficult for parallelization . In this paper, we propose a self-attention transducer (SA-T) for speech recognition. RNNs are replaced with self-attention blocks, which are powerful to model long-term dependencies inside sequences and able to be efficiently parallelized. Furthermore, a path-aware regularization is proposed to assist SA-T to learn alignments and improve the performance. Additionally, a chunk-flow mechanism is utilized to achieve online decoding. All experiments are conducted on a Mandarin Chinese dataset AISHELL-1. The results demonstrate that our proposed approach achieves a 21.3% relative reduction in character error rate compared with the baseline RNN-T. In addition, the SA-T with chunk-flow mechanism can perform online decoding with only a little degradation of the performance.

show abstract

“…However, Gaussian masking still requires the entire input sequence. Dong et al [25] introduced a chunk hopping mechanism to the CTC-Transformer model to support online recognition, which degraded the standard Transformer since it ignored the global context.…”

Section: Relation With Prior Workmentioning

confidence: 99%

Transformer ASR with Contextual Block Processing

Tsunoo

Kashiwagi

Kumakura

et al. 2019

2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)

View full text Add to dashboard Cite

The Transformer self-attention network has recently shown promising performance as an alternative to recurrent neural networks (RNNs) in end-to-end (E2E) automatic speech recognition (ASR) systems. However, the Transformer has a drawback in that the entire input sequence is required to compute self-attention. In this paper, we propose a new block processing method for the Transformer encoder by introducing a context-aware inheritance mechanism. An additional context embedding vector handed over from the previously processed block helps to encode not only local acoustic information but also global linguistic, channel, and speaker attributes. We introduce a novel mask technique to implement the context inheritance to train the model efficiently. Evaluations of the Wall Street Journal (WSJ), Librispeech, VoxForge Italian, and AISHELL-1 Mandarin speech recognition datasets show that our proposed contextual block processing method outperforms naive block processing consistently. Furthermore, the attention weight tendency of each layer is analyzed to clarify how the added contextual inheritance mechanism models the global information.

show abstract

Self-attention Aligner: A Latency-control End-to-end Model for ASR Using Self-attention Network and Chunk-hopping

Cited by 80 publications

References 21 publications

CIF: Continuous Integrate-And-Fire for End-To-End Speech Recognition

CIF: Continuous Integrate-And-Fire for End-To-End Speech Recognition

Self-Attention Transducers for End-to-End Speech Recognition

Transformer ASR with Contextual Block Processing

Contact Info

Product

Resources

About