Exploring Transformers for Large-Scale Speech Recognition

Lu, Liang; Liu, Changliang; Li, Jinyu; Gong, Yifan

doi:10.48550/arxiv.2005.09684

Cited by 11 publications

(11 citation statements)

References 16 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Several studies suggest that down-sampling input representation using convolutional layers before processing with transformer layers provides better results for ASR [24,25]. Intuitively, convolutional layers use local context to produce bet-ter contextual features.…”

Section: Resnet+transformer Modelmentioning

confidence: 99%

Beyond Isolated Utterances: Conversational Emotion Recognition

Pappagari¹,

Żelasko²,

Villalba³

et al. 2021

Preprint

View full text Add to dashboard Cite

Speech emotion recognition is the task of recognizing the speaker's emotional state given a recording of their utterance. While most of the current approaches focus on inferring emotion from isolated utterances, we argue that this is not sufficient to achieve conversational emotion recognition (CER) which deals with recognizing emotions in conversations. In this work, we propose several approaches for CER by treating it as a sequence labeling task. We investigated transformer architecture for CER and, compared it with ResNet-34 and BiLSTM architectures in both contextual and contextless scenarios using IEMOCAP corpus. Based on the inner workings of the self-attention mechanism, we proposed DiverseCatAugment (DCA), an augmentation scheme, which improved the transformer model performance by an absolute 3.3% micro-f1 on conversations and 3.6% on isolated utterances. We further enhanced the performance by introducing an interlocutor-aware transformer model where we learn a dictionary of interlocutor index embeddings to exploit diarized conversations.

show abstract

Section: Resnet+transformer Modelmentioning

confidence: 99%

Beyond Isolated Utterances: Conversational Emotion Recognition

Pappagari¹,

Żelasko²,

Villalba³

et al. 2021

Preprint

View full text Add to dashboard Cite

show abstract

“…Transformers [21] are powerful neural architectures that lately have been used in ASR [22][23][24], SLU [25], and other audio-visual applications [26] with great success, mainly due to their attention mechanism. Only until recently, the attention concept has also been applied to beamforming, specifically for speech and noise mask estimations [9,27].…”

Section: Introductionmentioning

confidence: 99%

End-to-End Multi-Channel Transformer for Speech Recognition

Chang

Radfar

Mouchtaris

et al. 2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

Transformers are powerful neural architectures that allow integrating different modalities using attention mechanisms. In this paper, we leverage the neural transformer architectures for multi-channel speech recognition systems, where the spectral and spatial information collected from different microphones are integrated using attention layers. Our multi-channel transformer network mainly consists of three parts: channel-wise self attention layers (CSA), cross-channel attention layers (CCA), and multi-channel encoder-decoder attention layers (EDA). The CSA and CCA layers encode the contextual relationship "within" and "between" channels and across time, respectively. The channel-attended outputs from CSA and CCA are then fed into the EDA layers to help decode the next token given the preceding ones. The experiments show that in a far-field in-house dataset, our method outperforms the baseline single-channel transformer, as well as the super-directive and neural beamformers cascaded with the transformers.

show abstract

“…End-to-end (E2E) automatic speech recognition (ASR) has made rapid progress in recent years [1,2,3,4,5,6,7]. Representative models include streaming models such as the recurrent neural network transducer (RNN-T) [1], attention-based models [8,2,3], and transformer-based models [9,10,11,12]. Compared to sophisticated conventional models [13,14], E2E models such as RNN-T and Listen, Attend and Spell (LAS) have shown competitive performance [6,5,7,15].…”

Section: Introductionmentioning

confidence: 99%

“…While long short-term memory (LSTM) has been a popular building block for E2E models, there has been a continuing success in applying transformer models [22] in ASR [23,11,10,9,24,25,4]. Instead of using a recurrent mechanism to model temporal dynamics, the transformer uses multi-headed attention to associate sequential elements in one step.…”

Section: Introductionmentioning

confidence: 99%

“…Instead of using a recurrent mechanism to model temporal dynamics, the transformer uses multi-headed attention to associate sequential elements in one step. [23,11] incorporate transformer layers to conventional models for acoustic modeling. For E2E models, the transformer has been adapted or applied to streaming models [10,9,12] and non-streaming models [4].…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Transformer Based Deliberation for Two-Pass Speech Recognition

Pang

Sainath

et al. 2021

Preprint

View full text Add to dashboard Cite

Interactive speech recognition systems must generate words quickly while also producing accurate results. Two-pass models excel at these requirements by employing a first-pass decoder that quickly emits words, and a second-pass decoder that requires more context but is more accurate. Previous work has established that a deliberation network can be an effective second-pass model. The model attends to two kinds of inputs at once: encoded audio frames and the hypothesis text from the first-pass model. In this work, we explore using transformer layers instead of long-short term memory (LSTM) layers for deliberation rescoring. In transformer layers, we generalize the "encoder-decoder" attention to attend to both encoded audio and first-pass text hypotheses. The output context vectors are then combined by a merger layer. Compared to LSTM-based deliberation, our best transformer deliberation achieves 7% relative word error rate improvements along with a 38% reduction in computation. We also compare against non-deliberation transformer rescoring, and find a 9% relative improvement.

show abstract

Exploring Transformers for Large-Scale Speech Recognition

Cited by 11 publications

References 16 publications

Beyond Isolated Utterances: Conversational Emotion Recognition

Beyond Isolated Utterances: Conversational Emotion Recognition

End-to-End Multi-Channel Transformer for Speech Recognition

Transformer Based Deliberation for Two-Pass Speech Recognition

Contact Info

Product

Resources

About