A finite-state turn-taking model for spoken dialog systems

Raux, Antoine; Eskénazi, Maxine

doi:10.3115/1620754.1620846

Cited by 84 publications

(75 citation statements)

References 16 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…More recent work on engagement with virtual agents uses more elaborate turn-taking models and supports multiparty conversation (Bohus & Horvitz, 2010). Research in spoken dialog systems also attempts to control the timing of turn-taking over the single modality of speech (Raux & Eskenazi, 2009). Although some results on cue usage in unembodied systems can generalize to robots, the timing of controlling actions on embodied machines differs substantially from that of virtual systems.…”

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

Timing in Multimodal Turn-Taking Interactions: Control and Analysis Using Timed Petri Nets

Chao

Thomaz

2012

JHRI

View full text Add to dashboard Cite

Turn-taking interactions with humans are multimodal and reciprocal in nature. In addition, the timing of actions is of great importance, as it influences both social and task strategies. To enable the precise control and analysis of timed discrete events for a robot, we develop a system for multimodal collaboration based on a timed Petri net (TPN) representation. We also argue for action interruptions in reciprocal interaction and describe its implementation within our system. Using the system, our autonomously operating humanoid robot Simon collaborates with humans through both speech and physical action to solve the Towers of Hanoi, during which the human and the robot take turns manipulating objects in a shared physical workspace. We hypothesize that action interruptions have a positive impact on turn-taking and evaluate this in the Towers of Hanoi domain through two experimental methods. One is a between-groups user study with 16 participants. The other is a simulation experiment using 200 simulated users of varying speed, initiative, compliance, and correctness. In these experiments, action interruptions are either present or absent in the system. Our collective results show that action interruptions lead to increased task efficiency through increased user initiative, improved interaction balance, and higher sense of fluency. In arriving at these results, we demonstrate how these evaluation methods can be highly complementary in the analysis of interaction dynamics.

show abstract

Section: Related Workmentioning

confidence: 99%

“…The work in (Raux & Eskenazi, 2009) and (Nakano et al, 2005) are examples of dialogue systems in which speech interruptions in particular are supported. Interruption has also been addressed more indirectly through an approach of behavior switching (Kanda, Ishiguro, Imai, & Ono, 2004).…”

Section: Action Atomicity In Reciprocal Interactionmentioning

confidence: 99%

Timing in Multimodal Turn-Taking Interactions: Control and Analysis Using Timed Petri Nets

Chao

Thomaz

2012

JHRI

View full text Add to dashboard Cite

show abstract

“…Experiments show that contours of loudness, approximated by normalized per-frame log-energy, should be concatenated with speech activity trajectories in feature space rather than in model space (as in [6]), in order to give models the opportunity to leverage cross-stream correlations; it appears that the most relevant information is found in audio frames which are both speech and very quiet. The absolute reduction in average cross entropy obtained using this approach, on unseen data consisting of 200 telephone conversations, is 0.031 bits per 100 ms frame of audio, a large improvement when compared to past research [7,8].…”

Section: • What Is the Likely Impact Of The Observed Average Cross Enmentioning

confidence: 99%

“…Although studied for many decades [2,3,4,5,6,7,8], these models continue to exhibit an important limitation: their implementation as N -grams circumscribes their direct applicability to only discrete-valued representations of conditioning context. This limitation has made it hard to study the impact of quantities which are continuous-valued (e.g., loudness or pitch), independently of higher-level linguistic landmarks or assumptions.…”

Section: Introductionmentioning

confidence: 99%

Exploiting loudness dynamics in stochastic models of turn-taking

Laskowski

2012

2012 IEEE Spoken Language Technology Workshop (SLT)

View full text Add to dashboard Cite

Stochastic turn-taking models have traditionally been implemented as N -grams, which condition predictions on recent binary-valued speech/non-speech contours. The current work re-implements this function using feed-forward neural networks, capable of accepting binary-as well as continuous-valued features; performance is shown to asymptotically approach that of the N -gram baseline as model complexity increases. The conditioning context is then extended to leverage loudness contours. Experiments indicate that the additional sensitivity to loudness considerably decreases average cross entropy rates on unseen data, by 0.03 bits per framing interval of 100 ms. This reduction is shown to make loudness-sensitive conversants capable of better predictions, with attention memory requirements at least 5 times smaller and responsiveness latency at least 10 times shorter than the loudness-insensitive baseline.

show abstract

“…More recently, however, work on incremental systems has shown that processing smaller 'chunks' of user input can improve the user experience by providing faster responses and allow more flexibility in turn-taking Purver and Otsuka, 2003;Skantze and Hjalmarsson, 2010;Raux and Eskenazi, 2009;Dethlefs et al, 2012b). Incrementality in spoken dialogue systems enables the system designer to model several dialogue phenomena that play a vital role in human conversation (Levelt, 1989), but have so far been absent from most systems.…”

Section: Introductionmentioning

confidence: 99%

Information density and overlap in spoken dialogue

Dethlefs

Hastie

Cuayáhuitl

et al. 2016

Computer Speech & Language

View full text Add to dashboard Cite

Incremental dialogue systems are often perceived as more responsive and natural because they are able to address phenomena of turn-taking and overlapping speech, such as backchannels or barge-ins. Previous work in this area has often identified distinctive prosodic features, or features relating to syntactic or semantic completeness, as marking appropriate places of turn-taking. In a separate strand of work, psycholinguistic studies have established a connection between information density and prominence in language-the less expected a linguistic unit is in a particular context, the more likely it is to be linguistically marked. This has been observed across linguistic levels, including the prosodic, which plays an important role in predicting overlapping speech.In this article, we explore the hypothesis that information density (ID) also plays a role in turn-taking. Specifically, we aim to show that humans are sensitive to the peaks and troughs of information density in speech, and that overlapping speech at ID troughs is perceived as more acceptable than overlaps at ID peaks. To test our hypothesis, we collect human ratings for three models of generating overlapping speech based on features of: (1) prosody and semantic or syntactic completeness, (2) information density, and (3) both types of information. Results show that over 50% of users preferred the version using both types of features, followed by a preference for information density features alone. This indicates a clear human sensitivity to the effects of information density in spoken language and provides a strong motivation to adopt this metric for the design, development and evaluation of turn-taking modules in spoken and incremental dialogue systems.

show abstract

A finite-state turn-taking model for spoken dialog systems

Cited by 84 publications

References 16 publications

Timing in Multimodal Turn-Taking Interactions: Control and Analysis Using Timed Petri Nets

Timing in Multimodal Turn-Taking Interactions: Control and Analysis Using Timed Petri Nets

Exploiting loudness dynamics in stochastic models of turn-taking

Information density and overlap in spoken dialogue

Contact Info

Product

Resources

About