Khiet P. Truong scite author profile

Work on voice sciences over recent decades has led to a proliferation of acoustic parameters that are used quite selectively and are not always extracted in a similar fashion. With many independent teams working in different research areas, shared standards become an essential safeguard to ensure compliance with state-of-the-art methods allowing appropriate comparison of results across studies and potential integration and combination of extraction and recognition systems. In this paper we propose a basic standard acoustic parameter set for various areas of automatic voice analysis, such as paralinguistic or clinical speech analysis. In contrast to a large brute-force parameter set, we present a minimalistic set of voice parameters here. These were selected based on a) their potential to index affective physiological changes in voice production, b) their proven value in former studies as well as their automatic extractability, and c) their theoretical significance. The set is intended to provide a common baseline for evaluation of future research and eliminate differences caused by varying parameter sets or even different implementations of the same parameters. Our implementation is publicly available with the openSMILE toolkit. Comparative evaluations of the proposed feature set and large baseline feature sets of INTERSPEECH challenges show a high performance of the proposed set in relation to its size.

show abstract

Automatic discrimination between laughter and speech

Truong¹,

Leeuwen²

2007

Speech Communication

131

118

View full text Add to dashboard Cite

Emotions can be recognized by audible paralinguistic cues in speech. By detecting these paralinguistic cues that can consist of laughter, a trembling voice, coughs, changes in the intonation contour etc., information about the speaker's state and emotion can be revealed. This paper describes the development of a gender-independent laugh detector with the aim to enable automatic emotion recognition. Different types of features (spectral, prosodic) for laughter detection were investigated using different classification techniques (Gaussian Mixture Models, Support Vector Machines, Multi Layer Perceptron) often used in language and speaker recognition. Classification experiments were carried out with short pre-segmented speech and laughter segments extracted from the ICSI Meeting Recorder Corpus (with a mean duration of approximately 2 s). Equal error rates of around 3% were obtained when tested on speaker-independent speech data. We found that a fusion between classifiers based on Gaussian Mixture Models and classifiers based on Support Vector Machines increases discriminative power. We also found that a fusion between classifiers that use spectral features and classifiers that use prosodic information usually increases the performance for discrimination between laughter and speech. Our acoustic measurements showed differences between laughter and speech in mean pitch and in the ratio of the durations of unvoiced to voiced portions, which indicate that these prosodic features are indeed useful for discrimination between laughter and speech.

show abstract

Comparing different approaches for automatic pronunciation error detection

Strik

Truong

Wet

et al. 2009

Speech Communication

103

View full text Add to dashboard Cite

International audienceOne of the biggest challenges in designing computer assisted language learning (CALL) applications that provide automatic feedback on pronunciation errors consists in reliably detecting the pronunciation errors at such a detailed level that the information provided can be useful to learners. In our research we investigate pronunciation errors frequently made by foreigners learning Dutch as a second language. In the present paper we focus on the velar fricative // and the velar plosive /k/. We compare four types of classifiers that can be used to detect erroneous pronunciations of these phones: two acoustic-phonetic classifiers (one of which employs linear-discriminant analysis (LDA)), a classifier based on cepstral coefficients in combination with LDA, and one based on confidence measures (the so-called Goodness Of Pronunciation score). The best results were obtained for the two LDA classifiers which produced accuracy levels of about 85-93%

show abstract

Backchannel Strategies for Artificial Listeners

Poppe

Truong

Reidsma

et al. 2010

View full text Add to dashboard Cite

We evaluate multimodal rule-based strategies for backchannel (BC) generation in face-to-face conversations. Such strategies can be used by artificial listeners to determine when to produce a BC in dialogs with human speakers. In this research, we consider features from the speaker's speech and gaze. We used six rule-based strategies to determine the placement of BCs. The BCs were performed by an intelligent virtual agent using nods and vocalizations. In a user perception experiment, participants were shown video fragments of a human speaker together with an artificial listener who produced BC behavior according to one of the strategies. Participants were asked to rate how likely they thought the BC behavior had been performed by a human listener. We found that the number, timing and type of BC had a significant effect on how human-like the BC behavior was perceived.The research leading to these results has received funding from the European Community's Seventh Framework Programme (FP7/20072013) under Grant agreement no. 211486 (SEMAINE).

show abstract

Multimodal subjectivity analysis of multiparty conversation

Raaijmakers

Truong²,

Wilson

2008

View full text Add to dashboard Cite

We investigate the combination of several sources of information for the purpose of subjectivity recognition and polarity classification in meetings. We focus on features from two modalities, transcribed words and acoustics, and we compare the performance of three different textual representations: words, characters, and phonemes. Our experiments show that character-level features outperform wordlevel features for these tasks, and that a careful fusion of all features yields the best performance. 1

show abstract

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Khiet P. Truong

The Geneva Minimalistic Acoustic Parameter Set (GeMAPS) for Voice Research and Affective Computing

Automatic discrimination between laughter and speech

Comparing different approaches for automatic pronunciation error detection

Backchannel Strategies for Artificial Listeners

Multimodal subjectivity analysis of multiparty conversation

Contact Info

Product

Resources

About