Improving Speech-Based End-of-Turn Detection Via Cross-Modal Representation Learning with Punctuated Text Data

Masumura, Ryo; Ihori, Mana; Tanaka, Tomohiro; Ando, Atsushi; Ishii, Ryo; Oba, Tadamichi; Higashinaka, Ryuichiro

doi:10.1109/asru46091.2019.9003816

Cited by 5 publications

(10 citation statements)

References 26 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…We used high-level abstracted features extracted from acoustic, linguistic, and visual modalities. We plan to use other interpretable features, such as prosody (Ferrer et al, 2002;Holler and Kendrick, 2015;Hömke et al, 2017;Holler et al, 2018;Masumura et al, 2018Masumura et al, , 2019Roddy et al, 2018) and gaze behavior (Chen and Harper, 2009;Kawahara et al, 2012;Jokinen et al, 2013;Ishii et al, 2015aIshii et al, , 2016a and to implement more complex predictive models (Masumura et al, 2018(Masumura et al, , 2019Roddy et al, 2018;Ward et al, 2018) that take into account temporal dependencies. Hara et al (2018) proposed a predictive model that can predict backchannels and fillers in addition to turn-changing using multi-task learning.…”

Section: Future Workmentioning

confidence: 99%

“…As a result of previous research on conversation turns and behaviors, many studies have developed models for predicting actual turn-changing, i.e., whether turn-changing or turn-keeping will take place, on the basis of acoustic features (Ferrer et al, 2002 ; Schlangen, 2006 ; Chen and Harper, 2009 ; de Kok and Heylen, 2009 ; Huang et al, 2011 ; Laskowski et al, 2011 ; Eyben et al, 2013 ; Jokinen et al, 2013 ; Hara et al, 2018 ; Lala et al, 2018 ; Masumura et al, 2018 , 2019 ; Roddy et al, 2018 ; Ward et al, 2018 ). They have used representative acoustic features from the speaker's speech such as log-mel and mel-frequency cepstral coefficients (MFCCs) as feature values.…”

Section: Related Workmentioning

confidence: 99%

“…The field of human-computer interaction has long been dedicated to computational modeling of turn-changing. Many studies have focused on developing actual turn-changing (i.e., next speaker or end-of-turn) models that can predict whether turn-keeping or turn-changing will happen using participants' verbal and non-verbal behaviors (Ferrer et al, 2002 ; Schlangen, 2006 ; Chen and Harper, 2009 ; de Kok and Heylen, 2009 ; Laskowski et al, 2011 ; Kawahara et al, 2012 ; Jokinen et al, 2013 ; Holler and Kendrick, 2015 ; Ishii et al, 2015a , b , 2016a , b , 2017 , 2019 ; Lammertink et al, 2015 ; Levinson, 2016 ; Hömke et al, 2017 ; Hara et al, 2018 ; Holler et al, 2018 ; Lala et al, 2018 ; Masumura et al, 2018 , 2019 ; Roddy et al, 2018 ; Ward et al, 2018 ). These studies predicted turn-changing on the basis of verbal and non-verbal behaviors.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Trimodal prediction of speaking and listening willingness to help improve turn-changing modeling

Ishii

Ren²,

Muszyński³

et al. 2022

Front. Psychol.

Self Cite

View full text Add to dashboard Cite

Participants in a conversation must carefully monitor the turn-management (speaking and listening) willingness of other conversational partners and adjust their turn-changing behaviors accordingly to have smooth conversation. Many studies have focused on developing actual turn-changing (i.e., next speaker or end-of-turn) models that can predict whether turn-keeping or turn-changing will occur. Participants' verbal and non-verbal behaviors have been used as input features for predictive models. To the best of our knowledge, these studies only model the relationship between participant behavior and turn-changing. Thus, there is no model that takes into account participants' willingness to acquire a turn (turn-management willingness). In this paper, we address the challenge of building such models to predict the willingness of both speakers and listeners. Firstly, we find that dissonance exists between willingness and actual turn-changing. Secondly, we propose predictive models that are based on trimodal inputs, including acoustic, linguistic, and visual cues distilled from conversations. Additionally, we study the impact of modeling willingness to help improve the task of turn-changing prediction. To do so, we introduce a dyadic conversation corpus with annotated scores of speaker/listener turn-management willingness. Our results show that using all three modalities (i.e., acoustic, linguistic, and visual cues) of the speaker and listener is critically important for predicting turn-management willingness. Furthermore, explicitly adding willingness as a prediction task improves the performance of turn-changing prediction. Moreover, turn-management willingness prediction becomes more accurate when this joint prediction of turn-management willingness and turn-changing is performed by using multi-task learning techniques.

show abstract

Section: Future Workmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Trimodal prediction of speaking and listening willingness to help improve turn-changing modeling

Ishii

Ren²,

Muszyński³

et al. 2022

Front. Psychol.

Self Cite

View full text Add to dashboard Cite

show abstract

“…With such knowledge, many studies have developed models for predicting actual turn-changing, i.e., whether turn-changing or turn-keeping will take place, on the basis of acoustic features [3, 6, 10, 12, 18, 26, 34, 36ś38, 43, 47, 50], linguistic features [34,37,38,43], and visual features, such as overall physical motion [3,6,8,43] near the end of a speaker's utterances or during multiple utterances. Moreover, some research has focused on detailed non-verbal behaviors such as eye-gaze behavior [3,6,18,20,24,26], head movement [18,21,22], mouth movement [23], and respiration [20,25].…”

Section: Related Work 21 Turn-changing Prediction Technologymentioning

confidence: 99%

“…We used automatically high-level abstracted features extracted from acoustic, linguistic, and visual modalities. We plan to use other interpretable features, such as prosody [10,15,16,19,37,38,43] and gaze behavior [3,20,24,26,30], and implement more complex prediction models [37,38,43,50] that take into account temporal dependencies.…”

Section: Future Workmentioning

confidence: 99%

Can Prediction of Turn-management Willingness Improve Turn-changing Modeling?

Ishii

Ren

Muszyński

et al. 2020

Proceedings of the 20th ACM International Conference on Intelligent Virtual Agents

Self Cite

View full text Add to dashboard Cite

For smooth conversation, participants must carefully monitor the turn-management (a.k.a. speaking and listening) willingness of other conversational partners and adjust turn-changing behaviors accordingly. Many studies have focused on predicting the actual moments of speaker changes (a.k.a. turn-changing), but to the best of our knowledge, none of them explicitly modeled the turn-management willingness from both speakers and listeners in dyad interactions. We address the problem of building models for predicting this willingness of both. Our models are based on trimodal inputs, including acoustic, linguistic, and visual cues from conversations. We also study the impact of modeling willingness to help improve the task of turn-changing prediction. We introduce a dyadic conversation corpus with annotated scores of speaker/listener turn-management willingness. Our results show that using all of three modalities of speaker and listener is important for predicting turn-management willingness. Furthermore, explicitly adding willingness as a prediction task improves the * Both authors contributed equally to this research.

show abstract

Training Spoken Language Understanding Systems with Non-Parallel Speech and Text

Sarı

Thomas

Hasegawa‐Johnson

2020

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

Improving Speech-Based End-of-Turn Detection Via Cross-Modal Representation Learning with Punctuated Text Data

Cited by 5 publications

References 26 publications

Trimodal prediction of speaking and listening willingness to help improve turn-changing modeling

Trimodal prediction of speaking and listening willingness to help improve turn-changing modeling

Can Prediction of Turn-management Willingness Improve Turn-changing Modeling?

Training Spoken Language Understanding Systems with Non-Parallel Speech and Text

Contact Info

Product

Resources

About