Exploiting loudness dynamics in stochastic models of turn-taking

Laskowski, Kornel

doi:10.1109/slt.2012.6424201

Cited by 5 publications

(6 citation statements)

References 19 publications

(20 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…This is very similar to our observations made for other non-telephony conversational corpora [15,17]. The performance of baseline "4" -the NN-based counterpart [6] to baseline "3" -indicates no relative advantage for either, as expected. Baseline "5" provides an alternative NN-based model, which also uses the feature vector construction method in Figure 1.b but the interlocutor portion of the vector contains the integer number of vocalizing interlocutors at instant t − s, a ternary variable for K = 3.…”

Section: Baseline Developmentsupporting

confidence: 90%

“…When fBL (·) is implemented as a Jelinek-Mercer-smoothed n-gram model as described in [15], the cross-entropies for TRAINSET and TESTSET are 0.256 bits/100ms and 0.239 bits/100ms, respectively; these are shown as baseline "1" in Figure 2. Baseline "2" represents an NN-based MI implementation [6] as described in Subsection 2.3; as for all other NNs in this article, we used J = 32 hidden units (decided using TRAINSET), with one hundred iterations of scaled conjugate gradient (SCG) pre-training 3 and one thousand iterations of SCG training. The performance of baselines "1" and "2" differs only negligibly 4 .…”

Section: Baseline Developmentmentioning

confidence: 99%

“…In the present paper, we use stochastic turn-taking (STT) modelling [6,7] as a convenient means of answering the following questions:…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Improving Prediction of Speech Activity Using Multi-Participant Respiratory State

Włodarczak¹,

Laskowski²,

Heldner³

et al. 2017

Interspeech 2017

Self Cite

View full text Add to dashboard Cite

One consequence of situated face-to-face conversation is the coobservability of participants' respiratory movements and sounds. We explore whether this information can be exploited in predicting incipient speech activity. Using a methodology called stochastic turn-taking modeling, we compare the performance of a model trained on speech activity alone to one additionally trained on static and dynamic lung volume features. The methodology permits automatic discovery of temporal dependencies across participants and feature types. Our experiments show that respiratory information substantially lowers cross-entropy rates, and that this generalizes to unseen data.

show abstract

Section: Baseline Developmentsupporting

confidence: 90%

Section: Baseline Developmentmentioning

confidence: 99%

See 1 more Smart Citation

Improving Prediction of Speech Activity Using Multi-Participant Respiratory State

Włodarczak¹,

Laskowski²,

Heldner³

et al. 2017

Interspeech 2017

Self Cite

View full text Add to dashboard Cite

show abstract

“…As in (Laskowski, 2012), the methodology employed here relies of forming a probability distribution over the side-attributed speech activity in entire dyadic conversations. This eliminates a dependency on the specific definition of a turn; the resulting probability models attempt to account for all speech, effectively marginalizing out alternative definitions of what turns are and where they start and end.…”

Section: Stochastic Turn Takingmentioning

confidence: 99%

A Scalable Method for Quantifying the Role of Pitch in Conversational Turn-Taking

Laskowski¹,

Włodarczak²,

Heldner³

2019

Proceedings of the 20th Annual SIGdial Meeting on Discourse and Dialogue

Self Cite

View full text Add to dashboard Cite

Pitch has long been held as an important signalling channel when planning and deploying speech in conversation, and myriad studies have been undertaken to determine the extent to which it actually plays this role. Unfortunately, these studies have required considerable human investment in data preparation and analysis, and have therefore often been limited to a handful of specific conversational contexts. The current article proposes a framework which addresses these limitations, by enabling a scalable, quantitative characterization of the role of pitch throughout an entire conversation, requiring only the raw signal and speech activity references. The framework is evaluated on the Switchboard dialogue corpus. Experiments indicate that pitch trajectories of both parties are predictive of their incipient speech activity; that pitch should be expressed on a logarithmic scale and Z-normalized, as well as accompanied by a binary voicing variable; and that only the most recent 400 ms of the pitch trajectory are useful in incipient speech activity prediction.

show abstract

“…In their most commonly studied form (Jaffe et al, 1967;Brady, 1969;, STT models condition their estimates on a history that consists exclusively of binary speech/nonspeech variables; the extension to more complex characterizations of the past have been studied (Laskowski, 2012) but comprise the minority. In this binary-feature mode of operation, STT models ablate from conversations the overwhelming majority of the overt information contained in them, including topic, choice of words, language spoken, intonation, stress, voice quality, and voice itself, leaving only speaker-attributed chronograms (Chapple, 1949) of binary-valued behavior.…”

Section: Introductionmentioning

confidence: 99%

Proceedings of the 17th Annual Meeting of the Special Interest Group on Discourse and Dialogue

Katagiri

Nakano

Fernández

et al. 2016

View full text Add to dashboard Cite

We extend special thanks to our Local co-Chairs, Ron Artstein and Alesia Gainer, and their team of student volunteers. We know SIGDIAL 2016 would not have been possible without Ron and Alesia, who invested so much effort in arranging the conference venue and accommodations, handling registration, making banquet arrangements, and handling numerous other preparations for the conference. The student volunteers for on-site assistance also deserve our appreciation.Ethan Selfridge, Sponsorships Chair, has earned our appreciation for recruiting and liaising with our conference sponsors, many of whom continue to contribute year after year. Sponsorships support valuable aspects of the program, such as the invited speakers and conference banquet. In recognition of this, we gratefully acknowledge the support of our sponsors: (Platinum level) Microsoft Research, Xerox and PARC, Intel, (Gold level) Facebook, (Silver level) Amazon Alexa, Interactions, Educational Testing Service, Honda Research Institute, and Yahoo!. At the same time, we thank Priscilla Rasmussen at the ACL for tirelessly handling the financial aspects of sponsorship for SIGDIAL 2016, and for securing our ISBN.iii We also thank the SIGdial board, especially officers Amanda Stent, Jason Williams and Kristiina Jokinen for their advice and support from beginning to end.Finally, we thank all the authors of the papers in this volume, and all the conference participants for making this stimulating event a valuable opportunity for growth in the research areas of discourse and dialogue. AbstractThis paper presents an end-to-end framework for task-oriented dialog systems using a variant of Deep Recurrent QNetworks (DRQN). The model is able to interface with a relational database and jointly learn policies for both language understanding and dialog strategy. Moreover, we propose a hybrid algorithm that combines the strength of reinforcement learning and supervised learning to achieve faster learning speed. We evaluated the proposed model on a 20 Question Game conversational game simulator. Results show that the proposed method outperforms the modular-based baseline and learns a distributed representation of the latent dialog state. IntroductionTask-oriented dialog systems have been an important branch of spoken dialog system (SDS) research (Raux et al., 2005; Young, 2006; Bohus and Rudnicky, 2003). The SDS agent has to achieve some predefined targets (e.g. booking a flight) through natural language interaction with the users. The typical structure of a task-oriented dialog system is outlined in Figure 1 (Young, 2006). This pipeline consists of several independently-developed modules: natural language understanding (the NLU) maps the user utterances to some semantic representation. This information is further processed by the dialog state tracker (DST), which accumulates the input of the turn along with the dialog history. The DST outputs the current dialog state and the dialog policy selects the next system action based on the dialog state. Then natural language gene...

show abstract

Exploiting loudness dynamics in stochastic models of turn-taking

Cited by 5 publications

References 19 publications

Improving Prediction of Speech Activity Using Multi-Participant Respiratory State

Improving Prediction of Speech Activity Using Multi-Participant Respiratory State

A Scalable Method for Quantifying the Role of Pitch in Conversational Turn-Taking

Proceedings of the 17th Annual Meeting of the Special Interest Group on Discourse and Dialogue

Contact Info

Product

Resources

About