Using Complexity-Identical Human- and Machine-Directed Utterances to Investigate Addressee Detection for Spoken Dialogue Systems

Akhtiamov, Oleg; Siegert, Ingo; Karpov, Alexey; Minker, Wolfgang

doi:10.3390/s20092740

Cited by 5 publications

(10 citation statements)

References 29 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…First, with greater importance on true wake-word independence. In Akhtiamov et al (2020) the classification process was improved by employing an ensemble classifier, consisting of several classification tasks that are combined in a late fusion approach, which allows combining the strength of the different methods into one singular system. Second, with a heightened sense of privacy by changing a system to ignore information that is not directed to the device either by using features with a limited use to detect what has been said (Baumann and Siegert, 2020) or by extending a wakeword detection system by an acoustic feature classification to improve the security of such a system from false activations (Wang et al, 2020).…”

Section: Time Developmentmentioning

confidence: 99%

“…For using RBC as a test set, one of the best performances of 60.90% Unweighted Average Recall (UAR) was achieved, when using VACC data together with RBC data with an end-to-end (e2e) speech processing model (Akhtiamov et al, 2019). But using a more complex meta-model, that makes use of different models to combine different layers of information, gives a slightly better performance of 62.80 UAR (Akhtiamov et al, 2019(Akhtiamov et al, , 2020.…”

Section: Studies Including Several Datasetsmentioning

confidence: 99%

“…Additionally, some studies also investigated cross-corpus experiments using SVC and VACC (Akhtiamov et al, 2019) and in combination with RBC (Akhtiamov et al, 2020). The authors used all of these three datasets for testing and training.…”

Section: Studies Including Several Datasetsmentioning

confidence: 99%

“…When only comparing VACC and SVC, which are designed with the same target classes, this rises to 172 similar features. In Akhtiamov et al (2020) where a meta-classifier approach is used, again the ComParE feature set is used, together with the ASR information and the spectrogram representation for the e2e approach.…”

Section: Studies Using Widely Known Feature Setsmentioning

confidence: 99%

“…Furthermore, three studies employed an e2e classification path, directly working on the acoustic representation (Akhtiamov et al, 2019(Akhtiamov et al, , 2020.…”

Section: Studies Using Widely Known Feature Setsmentioning

confidence: 99%

See 4 more Smart Citations

Acoustic-Based Automatic Addressee Detection for Technical Systems: A Review

Siegert

Weißkirchen²,

Wendemuth³

2022

Front. Comput. Sci.

Self Cite

View full text Add to dashboard Cite

ObjectiveAcoustic addressee detection is a challenge that arises in human group interactions, as well as in interactions with technical systems. The research domain is relatively new, and no structured review is available. Especially due to the recent growth of usage of voice assistants, this topic received increased attention. To allow a natural interaction on the same level as human interactions, many studies focused on the acoustic analyses of speech. The aim of this survey is to give an overview on the different studies and compare them in terms of utilized features, datasets, as well as classification architectures, which has so far been not conducted.MethodsThe survey followed the Preferred Reporting Items for Systematic reviews and Meta-Analysis (PRISMA) guidelines. We included all studies which were analyzing acoustic and/or acoustic characteristics of speech utterances to automatically detect the addressee. For each study, we describe the used dataset, feature set, classification architecture, performance, and other relevant findings.Results1,581 studies were screened, of which 23 studies met the inclusion criteria. The majority of studies utilized German or English speech corpora. Twenty-six percent of the studies were tested on in-house datasets, where only limited information is available. Nearly 40% of the studies employed hand-crafted feature sets, the other studies mostly rely on Interspeech ComParE 2013 feature set or Log-FilterBank Energy and Log Energy of Short-Time Fourier Transform features. 12 out of 23 studies used deep-learning approaches, the other 11 studies used classical machine learning methods. Nine out of 23 studies furthermore employed a classifier fusion.ConclusionSpeech-based automatic addressee detection is a relatively new research domain. Especially by using vast amounts of material or sophisticated models, device-directed speech is distinguished from non-device-directed speech. Furthermore, a clear distinction between in-house datasets and pre-existing ones can be drawn and a clear trend toward pre-defined larger feature sets (with partly used feature selection methods) is apparent.

show abstract

Section: Time Developmentmentioning

confidence: 99%

Section: Studies Including Several Datasetsmentioning

confidence: 99%

Section: Studies Including Several Datasetsmentioning

confidence: 99%

Section: Studies Using Widely Known Feature Setsmentioning

confidence: 99%

“…Furthermore, three studies employed an e2e classification path, directly working on the acoustic representation (Akhtiamov et al, 2019(Akhtiamov et al, , 2020.…”

Section: Studies Using Widely Known Feature Setsmentioning

confidence: 99%

See 3 more Smart Citations