Speaker intonation adaptation for transforming text-to-speech synthesis speaker identity

Langarani, Mahsa Sadat Elyasi; Santen, Jan P. H. van

doi:10.1109/asru.2015.7404783

Cited by 2 publications

(2 citation statements)

References 37 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…A pitch modulation algorithm was implemented drawing inspiration from Langarani et al [6]. A linear mapping between mean pitch and the attention level in place of GMM mapping was implemented.…”

Section: B Preparation Of Visual Feedback Systemmentioning

confidence: 99%

Automatic Speech-Gesture Mapping and Engagement Evaluation in Human Robot Interaction

Ghosh¹,

Dhall²,

Singla³

2019

2019 28th IEEE International Conference on Robot and Human Interactive Communication (RO-MAN)

View full text Add to dashboard Cite

A robot needs contextual awareness, effective speech production and complementing non-verbal gestures for successful communication in society. In this paper, we present our end-toend system that tries to enhance the effectiveness of non-verbal gestures. For achieving this, we identified prominently used gestures in performances by TED speakers and mapped them to their corresponding speech context and modulated speech based upon the attention of the listener. The proposed method utilized Convolutional Pose Machine [4] to detect the human gesture. Dominant gestures of TED speakers were used for learning the gesture-to-speech mapping. The speeches by them were used for training the model. We also evaluated the engagement of the robot with people by conducting a social survey. The effectiveness of the performance was monitored by the robot and it selfimprovised its speech pattern on the basis of the attention level of the audience, which was calculated using visual feedback from the camera. The effectiveness of interaction as well as the decisions made during improvisation was further evaluated based on the head-pose detection and interaction survey.

show abstract

Section: B Preparation Of Visual Feedback Systemmentioning

confidence: 99%

Automatic Speech-Gesture Mapping and Engagement Evaluation in Human Robot Interaction

Ghosh¹,

Dhall²,

Singla³

2019

2019 28th IEEE International Conference on Robot and Human Interactive Communication (RO-MAN)

View full text Add to dashboard Cite

show abstract

“…In contrast to Anumanchipalli et al [4,12], the phonological unit used in DRIFT is the foot. In [13], we proposed a new intonation adaptation method using the DRIFT to transform the perceived identity of a TTS system to that of a target speaker with a small amount of training data.…”

Section: Introductionmentioning

confidence: 99%

Foot-based intonation for text-to-speech synthesis using neural networks

Langarani

Santen²

2016

Speech Prosody 2016

Self Cite

View full text Add to dashboard Cite

We propose a method ("FONN") for F0 contour generation for text-to-speech synthesis. Training speech is automatically segmented into left-headed feet, annotated with syllable start/end times, foot position in the sentence, and the number of syllables in the foot. During training, we fit a superpositional intonation model comprising accent curves associated with feet and phrase curves. We propose to use a neural network for model parameter estimation. We tested the method against the HMM-based Speech Synthesis System (HTS) as well as against a template based variant of FONN ("DRIFT") by imposing contours generated by the methods onto natural speech and obtaining quality ratings. Test sets varied in degree of coverage by training data. Contours generated by DRIFT and FONN were strongly preferred over HTS-generated contours, especially for poorly-covered test items, with DRIFT slightly preferred over FONN. We conclude that the new methods hold promise for high-quality F0 contour generation while making efficient use of training data.

show abstract

Speaker intonation adaptation for transforming text-to-speech synthesis speaker identity

Cited by 2 publications

References 37 publications

Automatic Speech-Gesture Mapping and Engagement Evaluation in Human Robot Interaction

Automatic Speech-Gesture Mapping and Engagement Evaluation in Human Robot Interaction

Foot-based intonation for text-to-speech synthesis using neural networks

Contact Info

Product

Resources

About