2012 IEEE Spoken Language Technology Workshop (SLT) 2012
DOI: 10.1109/slt.2012.6424201
|View full text |Cite
|
Sign up to set email alerts
|

Exploiting loudness dynamics in stochastic models of turn-taking

Abstract: Stochastic turn-taking models have traditionally been implemented as N -grams, which condition predictions on recent binary-valued speech/non-speech contours. The current work re-implements this function using feed-forward neural networks, capable of accepting binary-as well as continuous-valued features; performance is shown to asymptotically approach that of the N -gram baseline as model complexity increases. The conditioning context is then extended to leverage loudness contours. Experiments indicate that t… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

1
5
0

Year Published

2015
2015
2019
2019

Publication Types

Select...
5

Relationship

3
2

Authors

Journals

citations
Cited by 5 publications
(6 citation statements)
references
References 19 publications
(20 reference statements)
1
5
0
Order By: Relevance
“…This is very similar to our observations made for other non-telephony conversational corpora [15,17]. The performance of baseline "4" -the NN-based counterpart [6] to baseline "3" -indicates no relative advantage for either, as expected. Baseline "5" provides an alternative NN-based model, which also uses the feature vector construction method in Figure 1.b but the interlocutor portion of the vector contains the integer number of vocalizing interlocutors at instant t − s, a ternary variable for K = 3.…”
Section: Baseline Developmentsupporting
confidence: 90%
See 2 more Smart Citations
“…This is very similar to our observations made for other non-telephony conversational corpora [15,17]. The performance of baseline "4" -the NN-based counterpart [6] to baseline "3" -indicates no relative advantage for either, as expected. Baseline "5" provides an alternative NN-based model, which also uses the feature vector construction method in Figure 1.b but the interlocutor portion of the vector contains the integer number of vocalizing interlocutors at instant t − s, a ternary variable for K = 3.…”
Section: Baseline Developmentsupporting
confidence: 90%
“…When fBL (·) is implemented as a Jelinek-Mercer-smoothed n-gram model as described in [15], the cross-entropies for TRAINSET and TESTSET are 0.256 bits/100ms and 0.239 bits/100ms, respectively; these are shown as baseline "1" in Figure 2. Baseline "2" represents an NN-based MI implementation [6] as described in Subsection 2.3; as for all other NNs in this article, we used J = 32 hidden units (decided using TRAINSET), with one hundred iterations of scaled conjugate gradient (SCG) pre-training 3 and one thousand iterations of SCG training. The performance of baselines "1" and "2" differs only negligibly 4 .…”
Section: Baseline Developmentmentioning
confidence: 99%
See 1 more Smart Citation
“…As in (Laskowski, 2012), the methodology employed here relies of forming a probability distribution over the side-attributed speech activity in entire dyadic conversations. This eliminates a dependency on the specific definition of a turn; the resulting probability models attempt to account for all speech, effectively marginalizing out alternative definitions of what turns are and where they start and end.…”
Section: Stochastic Turn Takingmentioning
confidence: 99%
“…In their most commonly studied form (Jaffe et al, 1967;Brady, 1969;, STT models condition their estimates on a history that consists exclusively of binary speech/nonspeech variables; the extension to more complex characterizations of the past have been studied (Laskowski, 2012) but comprise the minority. In this binary-feature mode of operation, STT models ablate from conversations the overwhelming majority of the overt information contained in them, including topic, choice of words, language spoken, intonation, stress, voice quality, and voice itself, leaving only speaker-attributed chronograms (Chapple, 1949) of binary-valued behavior.…”
Section: Introductionmentioning
confidence: 99%