2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2018
DOI: 10.1109/icassp.2018.8461478
|View full text |Cite
|
Sign up to set email alerts
|

Combining Acoustic Embeddings and Decoding Features for End-of-Utterance Detection in Real-Time Far-Field Speech Recognition Systems

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
24
0

Year Published

2019
2019
2023
2023

Publication Types

Select...
5
1

Relationship

2
4

Authors

Journals

citations
Cited by 25 publications
(24 citation statements)
references
References 12 publications
0
24
0
Order By: Relevance
“…The combined feature-based EPD algorithm [39] consists of three parts to detect the EOU exactly by fusing multiple features. They are an acoustic LSTM trained on the acoustic features, the word LSTM trained on the 1-best ASR decoding hypothesis, and the DSFs composed of three types of pause durations, which are described as follows.…”
Section: B Epd Based On Combining Afe We and Dsfsmentioning
confidence: 99%
See 1 more Smart Citation
“…The combined feature-based EPD algorithm [39] consists of three parts to detect the EOU exactly by fusing multiple features. They are an acoustic LSTM trained on the acoustic features, the word LSTM trained on the 1-best ASR decoding hypothesis, and the DSFs composed of three types of pause durations, which are described as follows.…”
Section: B Epd Based On Combining Afe We and Dsfsmentioning
confidence: 99%
“…To overcome this disadvantage, the expected pause duration is introduced as the stable feature for the online EPD task since it is obtained by interpolating the pause durations within all active hypotheses [37], [38]. Furthermore, it was observed that the word embedding (WE), which is obtained from the word LSTM [39] trained with the 1-best ASR decoding hypothesis to detect the turn-taking word, can yield the significant performance improvement of acoustic feature embedding (AFE)-based EPD without an actual decoding process, whereas the combination of the AFE, WE, and expected pause durations achieves the stateof-the-art EPD performance. Also, [40] incorporated the EOU symbol into the output within unified recurrent neural networks (RNN) transducer-based ASR system.…”
Section: Introductionmentioning
confidence: 99%
“…We conduct our experiments on training data of 1200-hour live data in English collected from the Amazon Echo. Each utterance is handtranscribed and begins with the same wake-up word whose alignment with time is provided by end-point detection [22,23,24,25]. As we have mentioned, while the training data is relatively clean and usually only contains device-directed speech, the test data is more challenging and under mismatched conditions with training data: it may be noisy, may contain background speech 4 , or may even contain no device-directed speech at all.…”
Section: Experimental Settingsmentioning
confidence: 99%
“…Also, as we mentioned in Sec 3.2, multi-task training can be conducted since we know the gold mask for each synthesized utterance. Given the imbalanced mask labels, 25.0% for the "hard" set, then the normalized values would be 1.000 and 5.000 respectively.…”
Section: Mask-based Modelmentioning
confidence: 99%
“…To the best of our knowledge, this is the first time neural embeddings are being used for token confidence modeling. There has been prior work on using neural embeddings for other tasks such as endpoint detection [10] and device-directed utterance detection [11]. Our objective is to develop a similar neural embedding based framework to improve upon the baseline token confidence model.…”
Section: Introductionmentioning
confidence: 99%