Self-Attentional Acoustic Models

Sperber, Matthias; Niehues, Jan; Neubig, Graham; Stüker, Sebastian; Waibel, Alex

doi:10.48550/arxiv.1803.09519

Cited by 21 publications

(38 citation statements)

References 0 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…There have been abundant studies on Transformers for end-toend speech recognition, particularly in the context of the S2S model with attention [5,6,7,16], as well as Transformer Transducers [8,17]. In [5,10], the authors compared RNNs with transformers for various speech recognition tasks, and obtained competitive or even better results with Transformers.…”

Section: Related Workmentioning

confidence: 99%

“…For speech recognition, Transformers have achieved competitive recognition accuracy compared to RNN-based counterparts within both end-to-end [5,6,7,8,9,10] and hybrid [11,12] frameworks. However, the superior results are usually achieved in the offline condition, while in the streaming fashion, Transformers have shown significant degradation in terms of accuracy from previous results [5,12], even in a condition of a large latency constraint.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Exploring Transformers for Large-Scale Speech Recognition

Lu¹,

Liu²,

Li³

et al. 2020

Interspeech 2020

View full text Add to dashboard Cite

While recurrent neural networks still largely define state-of-theart speech recognition systems, the Transformer network has been proven to be a competitive alternative, especially in the offline condition. Most studies with Transformers have been constrained in a relatively small scale setting, and some forms of data argumentation approaches are usually applied to combat the data sparsity issue. In this paper, we aim at understanding the behaviors of Transformers in the large-scale speech recognition setting, where we have used around 65,000 hours of training data. We investigated various aspects on scaling up Transformers, including model initialization, warmup training as well as different Layer Normalization strategies. In the streaming condition, we compared the widely used attention mask based future context lookahead approach to the Transformer-XL network. From our experiments, we show that Transformers can achieve around 6% relative word error rate (WER) reduction compared to the BLSTM baseline in the offline fashion, while in the streaming fashion, Transformer-XL is comparable to LC-BLSTM with 800 millisecond latency constraint.

show abstract

Section: Related Workmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Exploring Transformers for Large-Scale Speech Recognition

Lu¹,

Liu²,

Li³

et al. 2020

Interspeech 2020

View full text Add to dashboard Cite

show abstract

“…Using the notation in [51], we reshape audio in the following way, where X is a sequence of amplitudes X = {x 0 , x 1 , ..., x n }, l is the sequence length, and d is the hidden dimension:…”

Section: B Feature Extractionmentioning

confidence: 99%

Cyber-Physical Analytics: Environmental Sound Classification at the Edge

Elliott

Martino

Otero

et al. 2020

2020 IEEE 6th World Forum on Internet of Things (WF-IoT)

View full text Add to dashboard Cite

With the growth of the Internet of Things and the rise of Big Data, data processing and machine learning applications are being moved to cheap and low size, weight, and power (SWaP) devices at the edge, often in the form of mobile phones, embedded systems, or microcontrollers. The field of Cyber-Physical Measurements and Signature Intelligence (MASINT) makes use of these devices to analyze and exploit data in ways not otherwise possible, which results in increased data quality, increased security, and decreased bandwidth. However, methods to train and deploy models at the edge are limited, and models with sufficient accuracy are often too large for the edge device. Therefore, there is a clear need for techniques to create efficient AI/ML at the edge. This work presents training techniques for audio models in the field of environmental sound classification at the edge. Specifically, we design and train Transformers to classify office sounds in audio clips. Results show that a BERTbased Transformer, trained on Mel spectrograms, can outperform a CNN using 99.85% fewer parameters. To achieve this result, we first tested several audio feature extraction techniques designed for Transformers, using ESC-50 for evaluation, along with various augmentations. Our final model outperforms the state-ofthe-art MFCC-based CNN on the office sounds dataset, using just over 6,000 parameters -small enough to run on a microcontroller.

show abstract

“…There have been a few studies on transformers for end-to-end speech recognition, particularly for sequence-to-sequence with attention model [10,11,12], as well as transducer [13] and CTC models [14]. In [10], the authors compared RNNs with transformers for various speech recognition and synthesis tasks, and obtained competitive or even better results with transformers.…”

Section: Related Workmentioning

confidence: 99%

“…CNNs, on the other hand, require multiple layers to capture the correlations between the two features which are very distant in the time space, although dilation that uses large strides can reduce the number of layers that is required. While there have been many studies on end-to-end speech recognition using transformers [10,11,12,13,14], their applications for hybrid acoustic models are less well understood. In this paper, we study the more standard transformer for speech recognition within the hybrid framework, and provide further insight to this model through experiments on the Librispeech public dataset.…”

Section: Introductionmentioning

confidence: 99%

A Transformer with Interleaved Self-attention and Convolution for Hybrid Acoustic Models

2019

Preprint

View full text Add to dashboard Cite

Transformer with self-attention has achieved great success in the area of nature language processing. Recently, there have been a few studies on transformer for end-to-end speech recognition, while its application for hybrid acoustic model is still very limited. In this paper, we revisit the transformer-based hybrid acoustic model, and propose a model structure with interleaved self-attention and 1D convolution, which is proven to have faster convergence and higher recognition accuracy. We also study several aspects of the transformer model, including the impact of the positional encoding feature, dropout regularization, as well as training with and without time restriction. We show competitive recognition results on the public Librispeech dataset when compared to the Kaldi baseline at both cross entropy training and sequence training stages. For reproducible research, we release our source code and recipe within the PyKaldi2 toolbox.

show abstract

Self-Attentional Acoustic Models

Cited by 21 publications

References 0 publications

Exploring Transformers for Large-Scale Speech Recognition

Exploring Transformers for Large-Scale Speech Recognition

Cyber-Physical Analytics: Environmental Sound Classification at the Edge

A Transformer with Interleaved Self-attention and Convolution for Hybrid Acoustic Models

Contact Info

Product

Resources

About