2018 11th International Symposium on Chinese Spoken Language Processing (ISCSLP) 2018
DOI: 10.1109/iscslp.2018.8706625
|View full text |Cite
|
Sign up to set email alerts
|

Emphasis Detection for Voice Dialogue Applications Using Multi-channel Convolutional Bidirectional Long Short-Term Memory Network

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
6
0

Year Published

2019
2019
2021
2021

Publication Types

Select...
3
2
1

Relationship

0
6

Authors

Journals

citations
Cited by 8 publications
(6 citation statements)
references
References 17 publications
0
6
0
Order By: Relevance
“…We wish to investigate CNN-based automatic learning of word-level features from the same low-level acoustic contours. Given the previously observed speaker-dependence of the relative importances of the different prosodic attributes, we investigate a 3-channel CNN architecture where attribute-wise embeddings are computed with their own best filters and concatenated for the final representation [22]. The contour groups are F0 (4 contours), intensity (4 contours) and spectral shape including HNR and spectral band energies (7 contours) and each feature group is input to separate CNN filter bank as shown in Figure 1.…”
Section: Learning Word-level Features With Cnnmentioning
confidence: 99%
See 3 more Smart Citations
“…We wish to investigate CNN-based automatic learning of word-level features from the same low-level acoustic contours. Given the previously observed speaker-dependence of the relative importances of the different prosodic attributes, we investigate a 3-channel CNN architecture where attribute-wise embeddings are computed with their own best filters and concatenated for the final representation [22]. The contour groups are F0 (4 contours), intensity (4 contours) and spectral shape including HNR and spectral band energies (7 contours) and each feature group is input to separate CNN filter bank as shown in Figure 1.…”
Section: Learning Word-level Features With Cnnmentioning
confidence: 99%
“…While our multi-channel CNN framework is similar to that of Zhang et al [22], we expand the search for architecture choices by considering the use of multiple kernel widths in each channel to capture the distinct time scales of acoustic variation. We start from the 4 kernels with widths [5,11,25,51] similar to that of the sentence parsing CNN architecture of Trang et al [31], which roughly cover sub-phone, phone, syllable and word, and possibly some context.…”
Section: Cnn Training and Performancementioning
confidence: 99%
See 2 more Smart Citations
“…Both local acoustic features and longer, more global contexts spanning several words and possibly different sentences across the utterance are important in the perception of prominence. Hence, architectures combining low-level feature aggregation with sequence models were realized with the same contour-learned features input to an LSTM classification layer [18,19].…”
Section: Introductionmentioning
confidence: 99%