2017
DOI: 10.48550/arxiv.1703.07754
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Direct Acoustics-to-Word Models for English Conversational Speech Recognition

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

2
38
0

Year Published

2017
2017
2021
2021

Publication Types

Select...
7
1

Relationship

0
8

Authors

Journals

citations
Cited by 20 publications
(40 citation statements)
references
References 0 publications
2
38
0
Order By: Relevance
“…The former approach is considered more pertinent to tasks like isolated word recognition, classification and detection, while the latter to sentence-level classification and large vocabulary continuous speech recognition (LVCSR). Nevertheless, recent advances in speech recognition and natural language processing show that direct modeling of words is feasible even for LVCSR, [7] [8] [9].…”
Section: Introductionmentioning
confidence: 99%
“…The former approach is considered more pertinent to tasks like isolated word recognition, classification and detection, while the latter to sentence-level classification and large vocabulary continuous speech recognition (LVCSR). Nevertheless, recent advances in speech recognition and natural language processing show that direct modeling of words is feasible even for LVCSR, [7] [8] [9].…”
Section: Introductionmentioning
confidence: 99%
“…For the TextCNN module, all words in a document are first embedded as 300-dimension vectors by the pretrained GloVe embedding [19] and then fed to the TextCNN. The network has three parallel convolutional layers with identical filters number of 256 and different kernel sizes of (2,3,4), and all convolutional layers use ReLU as the activation function. The outputs of the three parallel convolutional layers are concatenated and pooled as a 768-dimension vector that is used for topic classification.…”
Section: Ttc Modulementioning
confidence: 99%
“…Conventional TC systems on spoken documents are usually designed as pipeline structures, which first transform speech into text through an automatic speech recognition (ASR) module and then perform topic classification on the recognized text through a text topic classification (TTC) module. Regarding ASR, end-to-end models have become popular alternatives to conventional deep neural networkhidden Markov model (DNN-HMM) hybrids because of their simpler model architecture and comparable or even better performance [1][2][3][4][5][6]. One of the most representative end-to-end models is the connection temporal classification (CTC)-based framework [6].…”
Section: Introductionmentioning
confidence: 99%
“…Prior work has also attempted direct-to-word speech recognition [2,21,31]. These approaches require massive data sets to work well [31] and do not have adaptable lexicons.…”
Section: Related Workmentioning
confidence: 99%