Expansion of WFST-Based Dialog Management for Handling Multiple ASR Hypotheses

Kimura, Naoto; Hori, Chiori; Misu, Teruhisa; Ohtake, Kiyonori; Kawai, Hisashi; Nakamura, Satoshi

doi:10.1007/978-3-642-16202-2_6

Cited by 1 publication

(2 citation statements)

References 12 publications

(9 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In [12] a dialog system is described which transforms textual user utterances into response sentences using weighted FSTs, with the goal to be able to run a full back-and-forth dialog with the users. It was extended by [13] to accept n-best hypotheses from a triphone model acoustic model, which were combined with an additional 3-gram language model, as input. Eesen [14] introduced FST-decoding to models outputting character-based Connectionist Temporal Classification (CTC) [15] labels, similar to the Quartznet model of Scribosermo.…”

Section: Introductionmentioning

confidence: 99%

“…Alexa also uses FSTs for its skill kit, but keeps separate models for STT and NLU [16]. This work follows a very similar decoding approach as Eesen, which allows using recent CTC-based STT models (in difference to [10,11,13]), but alters the Grammar-FST (explained in the next chapters) to embed NLU information into it, similar to the semantic tagging of [9,10,11], which allows combining the two distinct STT+NLU models into a single SLU decoder.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Finstreder: Simple and fast Spoken Language Understanding with Finite State Transducers using modern Speech-to-Text models

Daniel¹,

Poeppel²,

Reif³

2022

Preprint

View full text Add to dashboard Cite

In Spoken Language Understanding (SLU) the task is to extract important information from audio commands, like the intent of what a user wants the system to do and special entities like locations or numbers. This paper presents a simple method for embedding intents and entities into Finite State Transducers, and, in combination with a pretrained general-purpose Speechto-Text model, allows building SLU-models without any additional training. Building those models is very fast and only takes a few seconds. It is also completely language independent. With a comparison on different benchmarks it is shown that this method can outperform multiple other, more resource demanding SLU approaches.

show abstract

Section: Introductionmentioning

confidence: 99%