Emphasis Detection for Voice Dialogue Applications Using Multi-channel Convolutional Bidirectional Long Short-Term Memory Network

Zhang, Long; Jia, Jia; Meng, F.; Zhou, Shijie; Chen, Wei; Zhang, Cunjun; Li, Runnan

doi:10.1109/iscslp.2018.8706625

Cited by 8 publications

(6 citation statements)

References 17 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…We wish to investigate CNN-based automatic learning of word-level features from the same low-level acoustic contours. Given the previously observed speaker-dependence of the relative importances of the different prosodic attributes, we investigate a 3-channel CNN architecture where attribute-wise embeddings are computed with their own best filters and concatenated for the final representation [22]. The contour groups are F0 (4 contours), intensity (4 contours) and spectral shape including HNR and spectral band energies (7 contours) and each feature group is input to separate CNN filter bank as shown in Figure 1.…”

Section: Learning Word-level Features With Cnnmentioning

confidence: 99%

“…While our multi-channel CNN framework is similar to that of Zhang et al [22], we expand the search for architecture choices by considering the use of multiple kernel widths in each channel to capture the distinct time scales of acoustic variation. We start from the 4 kernels with widths [5,11,25,51] similar to that of the sentence parsing CNN architecture of Trang et al [31], which roughly cover sub-phone, phone, syllable and word, and possibly some context.…”

Section: Cnn Training and Performancementioning

confidence: 99%

“…We start from the 4 kernels with widths [5,11,25,51] similar to that of the sentence parsing CNN architecture of Trang et al [31], which roughly cover sub-phone, phone, syllable and word, and possibly some context. Given the fixed narrow kernel width of 3 frames used in [20,22], we add this to our candidates for testing. From the different combinations presented in Table 3, we observe that the syllable and word width kernel sizes (25,51) helps the performance while including other widths does not change it.…”

Section: Cnn Training and Performancementioning

confidence: 99%

“…With word position indicators provided in the input segment, they report an improvement of 1-3% points absolute over Rosenberg [11] on lexical stress and phrase boundary detection on the BURNC corpus, with speaker-independent scenarios being more challenging. Zhang et al [22] also use acoustic contours and MFCC features over 10 s segments as inputs to a CNN with fixed narrow kernel width of 3 frames, with syllable and word position indicators marked at the frame level. The CNN outputs go to a BLSTM classifier to obtain emphasis at frame level.…”

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

Prosodic event detection in children’s read speech

Sabu

Rao

2021

Computer Speech & Language

View full text Add to dashboard Cite

Section: Learning Word-level Features With Cnnmentioning

confidence: 99%

Section: Cnn Training and Performancementioning

confidence: 99%

Section: Cnn Training and Performancementioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Prosodic event detection in children’s read speech

Sabu

Rao

2021

Computer Speech & Language

View full text Add to dashboard Cite

“…Both local acoustic features and longer, more global contexts spanning several words and possibly different sentences across the utterance are important in the perception of prominence. Hence, architectures combining low-level feature aggregation with sequence models were realized with the same contour-learned features input to an LSTM classification layer [18,19].…”

Section: Introductionmentioning

confidence: 99%

Deep Learning For Prominence Detection In Children's Read Speech

Vaidya¹,

Sabu²,

Rao³

2021

Preprint

View full text Add to dashboard Cite

The detection of perceived prominence in speech has attracted approaches ranging from the design of linguistic knowledge-based acoustic features to the automatic feature learning from suprasegmental attributes such as pitch and intensity contours. We present here, in contrast, a system that operates directly on segmented speech waveforms to learn features relevant to prominent word detection for children's oral fluency assessment. The chosen CRNN (convolutional recurrent neural network) framework, incorporating both word-level features and sequence information, is found to benefit from the perceptually motivated SincNet filters as the first convolutional layer. We further explore the benefits of the linguistic association between the prosodic events of phrase boundary and prominence with different multi-task architectures. Matching the previously reported performance on the same dataset of a random forest ensemble predictor trained on carefully chosen hand-crafted acoustic features, we evaluate further the possibly complementary information from hand-crafted acoustic and pre-trained lexical features.

show abstract

Emotional Design for Children’s Electronic Picture Book

Jia

et al. 2019

Human-Computer Interaction. Perspectives on Design

View full text Add to dashboard Cite

Emphasis Detection for Voice Dialogue Applications Using Multi-channel Convolutional Bidirectional Long Short-Term Memory Network

Cited by 8 publications

References 17 publications

Prosodic event detection in children’s read speech

Prosodic event detection in children’s read speech

Deep Learning For Prominence Detection In Children's Read Speech

Emotional Design for Children’s Electronic Picture Book

Contact Info

Product

Resources

About