We describe a statistical approach for modeling dialogue acts in conversational speech, i.e., speech-act-like units such as STATEMENT, QUESTION, BACKCHANNEL, AGREEMENT, DIS-AGREEMENT, and APOLOGY. Our model detects and predicts dialogue acts based on lexical, collocational, and prosodic cues, as well as on the discourse coherence of the dialogue act sequence. The dialogue model is based on treating the discourse structure of a conversation as a hidden Markov model and the individual dialogue acts as observations emanating from the model states. Constraints on the likely sequence of dialogue acts are modeled via a dialogue act n-gram. The statistical dialogue grammar is combined with word n-grams, decision trees, and neural networks modeling the idiosyncratic lexical and prosodic manifestations of each dialogue act. We develop a probabilistic integration of speech recognition with dialogue modeling, to improve both speech recognition and dialogue act classification accuracy. Models are trained and evaluated using a large hand-labeled database of 1,155 conversations from the Switchboard corpus of spontaneous human-to-human telephone speech. We achieved good dialogue act labeling accuracy (65% based on errorful, automatically recognized words and prosody, and 71% based on word transcripts, compared to a chance baseline accuracy of 35% and human accuracy of 84%) and a small reduction in word recognition error.
No abstract
A crucial step in processing speech audio data for information extraction, topic detection, or browsing/playback is to segment the input into sentence and topic units. Speech segmentation is challenging, since the cues typically present for segmenting text (headers, paragraphs, punctuation) are absent in spoken language. We investigate the use of prosody (information gleaned from the timing and melody of speech) for these tasks. Using decision tree and hidden Markov modeling techniques, we combine prosodic cues with word-based approaches, and evaluate performance on two speech corpora, Broadcast News and Switchboard. Results show that the prosodic model alone performs on par with, or better than, word-based statistical language models ± for both true and automatically recognized words in news speech. The prosodic model achieves comparable performance with signi®cantly less training data, and requires no hand-labeling of prosodic events. Across tasks and corpora, we obtain a signi®cant improvement over word-only models using a probabilistic combination of prosodic and lexical information. Inspection reveals that the prosodic models capture language-independent boundary indicators described in the literature. Finally, cue usage is task and corpus dependent. For example, pause and pitch features are highly informative for segmenting news speech, whereas pause, duration and word-based cues dominate for natural conversation. Ó 2000 Elsevier Science B.V. All rights reserved. Zusammenfassung ResumeUne etape cruciale dans le traitement de la parole pour l'extraction d'information, la detection du sujet de conversation et la navigation est la segmentation du discours. Celle-ci est di cile car les indices aidant a segmenter un texte (en-tetes, paragraphes, ponctuation) n'apparaissent pas dans le language parle. Nous etudions l'usage de la prosodie (l'information extraite du rythme et de la melodie de la parole) a cet e et. A l'aide d'arbres de decision et de chaõnes de Markov cachees, nous combinons les indices prosodiques avec le modele du langage. Nous evaluons notre algorithme sur deux corpora, Broadcast News et Switchboard. Nos resultats indiquent que le modele prosodique est equivalent ou superieur au modele du langage, et qu'il requiert moins de donnees d'entraõnement. Il ne necessite pas d'annotations manuelles de la prosodie. De plus, nous obtenons un gain signi®catif en combinant de maniere probabiliste l'information prosodique et lexicale, et ce pour di erents corpora et applications. Une inspection plus detaillee des resultats revele que les modeles prosodiques identi®ent les indicateurs de debut et de ®n de segments, tel que decrit dans la litterature. Finalement, l'usage des indices prosodiques depend de l'application et du corpus. Par exemple, le ton s'avere extremement utile pour la segmentation des bulletins televises, alors que les caracteristiques de duree et celles extraites du modele du langage servent davantage pour la segmentation de conversations naturelles. Ó
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.