Direct Acoustics-to-Word Models for English Conversational Speech Recognition

Audhkhasi, Kartik; Ramabhadran, Bhuvana; Saon, George; Picheny, Michael; Nahamoo, D.

doi:10.48550/arxiv.1703.07754

Cited by 20 publications

(40 citation statements)

References 0 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The former approach is considered more pertinent to tasks like isolated word recognition, classification and detection, while the latter to sentence-level classification and large vocabulary continuous speech recognition (LVCSR). Nevertheless, recent advances in speech recognition and natural language processing show that direct modeling of words is feasible even for LVCSR, [7] [8] [9].…”

Section: Introductionmentioning

confidence: 99%

Combining Residual Networks with LSTMs for Lipreading

Stafylakis¹,

2017

View full text Add to dashboard Cite

We propose an end-to-end deep learning architecture for wordlevel visual speech recognition. The system is a combination of spatiotemporal convolutional, residual and bidirectional Long Short-Term Memory networks. We train and evaluate it on the Lipreading In-The-Wild benchmark, a challenging database of 500-size target-words consisting of 1.28sec video excerpts from BBC TV broadcasts. The proposed network attains word accuracy equal to 83.0%, yielding 6.8% absolute improvement over the current state-of-the-art, without using information about word boundaries during training or testing.

show abstract

Section: Introductionmentioning

confidence: 99%

Combining Residual Networks with LSTMs for Lipreading

Stafylakis¹,

2017

View full text Add to dashboard Cite

show abstract

“…For the TextCNN module, all words in a document are first embedded as 300-dimension vectors by the pretrained GloVe embedding [19] and then fed to the TextCNN. The network has three parallel convolutional layers with identical filters number of 256 and different kernel sizes of (2,3,4), and all convolutional layers use ReLU as the activation function. The outputs of the three parallel convolutional layers are concatenated and pooled as a 768-dimension vector that is used for topic classification.…”

Section: Ttc Modulementioning

confidence: 99%

“…Conventional TC systems on spoken documents are usually designed as pipeline structures, which first transform speech into text through an automatic speech recognition (ASR) module and then perform topic classification on the recognized text through a text topic classification (TTC) module. Regarding ASR, end-to-end models have become popular alternatives to conventional deep neural networkhidden Markov model (DNN-HMM) hybrids because of their simpler model architecture and comparable or even better performance [1][2][3][4][5][6]. One of the most representative end-to-end models is the connection temporal classification (CTC)-based framework [6].…”

Section: Introductionmentioning

confidence: 99%

Topic Classification on Spoken Documents Using Deep Acoustic and Linguistic Features

Liu

Guo

2021

2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)

View full text Add to dashboard Cite

Topic classification systems on spoken documents usually consist of two modules: an automatic speech recognition (ASR) module to convert speech into text and a text topic classification (TTC) module to predict the topic class from the decoded text. In this paper, instead of using the ASR transcripts, the fusion of deep acoustic and linguistic features is used for topic classification on spoken documents. More specifically, a conventional CTC-based acoustic model (AM) using phonemes as output units is first trained, and the outputs of the layer before the linear phoneme classifier in the trained AM are used as the deep acoustic features of spoken documents. Furthermore, these deep acoustic features are fed to a phoneme-to-word (P2W) module to obtain deep linguistic features. Finally, a local multi-head attention module is proposed to fuse these two types of deep features for topic classification. Experiments conducted on a subset selected from Switchboard corpus show that our proposed framework outperforms the conventional ASR+TTC systems and achieves a 3.13% improvement in ACC.

show abstract

“…Prior work has also attempted direct-to-word speech recognition [2,21,31]. These approaches require massive data sets to work well [31] and do not have adaptable lexicons.…”

Section: Related Workmentioning

confidence: 99%

Word-level Speech Recognition with a Letter to Word Encoder

Collobert,

Hannun,

Synnaeve

2019

Preprint

View full text Add to dashboard Cite

We propose a direct-to-word sequence model with a dynamic lexicon. Our word network constructs word embeddings dynamically from the character level tokens. The word network can be integrated seamlessly with arbitrary sequence models including Connectionist Temporal Classification and encoder-decoder models with attention. Sub-word units are commonly used in speech recognition yet are generated without the use of acoustic context. We show our direct-to-word model can achieve word error rate gains over sub-word level models for speech recognition. Furthermore, we empirically validate that the word-level embeddings we learn contain significant acoustic information, making them more suitable for use in speech recognition. We also show that our direct-to-word approach retains the ability to predict words not seen at training time without any retraining.

show abstract

Direct Acoustics-to-Word Models for English Conversational Speech Recognition

Cited by 20 publications

References 0 publications

Combining Residual Networks with LSTMs for Lipreading

Combining Residual Networks with LSTMs for Lipreading

Topic Classification on Spoken Documents Using Deep Acoustic and Linguistic Features

Word-level Speech Recognition with a Letter to Word Encoder

Contact Info

Product

Resources

About