Global transcriptional profiling between inbred parents and hybrids provides comprehensive insights into ear-length heterosis of maize (Zea mays)

Self-attention is a method of encoding sequences of vectors by relating these vectors to each-other based on pairwise similarities. These models have recently shown promising results for modeling discrete sequences, but they are non-trivial to apply to acoustic modeling due to computational and modeling issues. In this paper, we apply self-attention to acoustic modeling, proposing several improvements to mitigate these issues: First, self-attention memory grows quadratically in the sequence length, which we address through a downsampling technique. Second, we find that previous approaches to incorporate position information into the model are unsuitable and explore other representations and hybrid models to this end. Third, to stress the importance of local context in the acoustic signal, we propose a Gaussian biasing approach that allows explicit control over the context range. Experiments find that our model approaches a strong baseline based on LSTMs with networkin-network connections while being much faster to compute. Besides speed, we find that interpretability is a strength of selfattentional acoustic models, and demonstrate that self-attention heads learn a linguistically plausible division of labor. 1

show abstract

Improving Sequence-To-Sequence Speech Recognition Training with On-The-Fly Data Augmentation

Nguyen

Stüker

Niehues

et al. 2020

View full text Add to dashboard Cite

Sequence-to-Sequence (S2S) models recently started to show state-of-the-art performance for automatic speech recognition (ASR). With these large and deep models overfitting remains the largest problem, outweighing performance improvements that can be obtained from better architectures. One solution to the overfitting problem is increasing the amount of available training data and the variety exhibited by the training data with the help of data augmentation. In this paper we examine the influence of three data augmentation methods on the performance of two S2S model architectures. One of the data augmentation method comes from literature, while two other methods are our own development a time perturbation in the frequency domain and sub-sequence sampling. Our experiments on Switchboard and Fisher data show state-of-theart performance for S2S models that are trained solely on the speech training data and do not use additional text data.Many successful S2S models adopt log-mel frequency features as input. In the frequency domain, one major difficulty 1 The source code is available at https://github.com/thaisonngn/pynn arXiv:1910.13296v1 [eess.AS]

show abstract

Findings of the Iwslt 2020 Evaluation Campaign

Ansari¹,

Axelrod²,

Bach³

et al. 2020

View full text Add to dashboard Cite

The evaluation campaign of the International Conference on Spoken Language Translation (IWSLT 2020) featured this year six challenge tracks: (i) Simultaneous speech translation, (ii) Video speech translation, (iii) Offline speech translation, (iv) Conversational speech translation, (v) Open domain translation, and (vi) Non-native speech translation. A total of 30 teams participated in at least one of the tracks. This paper introduces each track's goal, data and evaluation metrics, and reports the results of the received submissions.

show abstract

Self-Attentional Acoustic Models

Sperber¹,

Niehues²,

Neubig³

et al. 2018

Preprint

View full text Add to dashboard Cite

Enhancing Backchannel Prediction Using Word Embeddings

Ruede¹,

Müller²,

Stüker³

et al. 2017

View full text Add to dashboard Cite

Backchannel responses like "uh-huh", "yeah", "right" are used by the listener in a social dialog as a way to provide feedback to the speaker. In the context of human-computer interaction, these responses can be used by an artificial agent to build rapport in conversations with users. In the past, multiple approaches have been proposed to detect backchannel cues and to predict the most natural timing to place those backchannel utterances. Most of these are based on manually optimized fixed rules, which may fail to generalize. Many systems rely on the location and duration of pauses and pitch slopes of specific lengths. In the past, we proposed an approach by training artificial neural networks on acoustic features such as pitch and power and also attempted to add word embeddings via word2vec. In this work, we refined this approach by evaluating different methods to add timed word embeddings via word2vec. Comparing the performance using various feature combinations, we could show that adding linguistic features improves the performance over a prediction system that only uses acoustic features.

show abstract

Comparison of Decoding Strategies for CTC Acoustic Models

Zenkel¹,

Sanabria

Metze

et al. 2017

View full text Add to dashboard Cite

Connectionist Temporal Classification has recently attracted a lot of interest as it offers an elegant approach to building acoustic models (AMs) for speech recognition. The CTC loss function maps an input sequence of observable feature vectors to an output sequence of symbols. Output symbols are conditionally independent of each other under CTC loss, so a language model (LM) can be incorporated conveniently during decoding, retaining the traditional separation of acoustic and linguistic components in ASR.For fixed vocabularies, Weighted Finite State Transducers provide a strong baseline for efficient integration of CTC AMs with n-gram LMs. Character-based neural LMs provide a straight forward solution for open vocabulary speech recognition and all-neural models, and can be decoded with beam search. Finally, sequence-to-sequence models can be used to translate a sequence of individual sounds into a word string.We compare the performance of these three approaches, and analyze their error patterns, which provides insightful guidance for future research and development in this important area.

show abstract

Relative Positional Encoding for Speech Recognition and Direct Translation

Pham¹,

Ha²,

Nguyen³

et al. 2020

View full text Add to dashboard Cite

Very Deep Self-Attention Networks for End-to-End Speech Recognition

Pham¹,

Nguyen²,

Niehues³

et al. 2019

Preprint

View full text Add to dashboard Cite

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Sebastian Stüker

Self-Attentional Acoustic Models

Improving Sequence-To-Sequence Speech Recognition Training with On-The-Fly Data Augmentation

Findings of the Iwslt 2020 Evaluation Campaign

Self-Attentional Acoustic Models

Enhancing Backchannel Prediction Using Word Embeddings

Comparison of Decoding Strategies for CTC Acoustic Models

Relative Positional Encoding for Speech Recognition and Direct Translation

Very Deep Self-Attention Networks for End-to-End Speech Recognition

Contact Info

Product

Resources

About