“…The techniques used for experimentations on vocal synthesis are e.g. VOSIM [20], Frequency Modulation (FM) [21], Klatt Filter Model [22], Time-Domain Formant-Wave-Function Synthesis (FOF) [23], Phase Aligned Formant Synthesis (PAF) [24] and Spectral Modeling followed by Additive Synthesis [25]. These algorithms have been evaluated regarding computational costs, suitable parameterization, and of course, the resulting sound quality they produce.…”
We present a prototype of a humanoid robot head equipped with human-like speech sound localization and production systems designed for a new generation of robots that should autonomously evolve language and other cognitive skills. Similarly to the human auditory apparatus, the robot head contains a binaural sensor system based upon a frequency domain binaural model. This enables the robot to detect and locate the speaker autonomously on the basis of the produced speech signals. However, the temporal regularity of incoming sounds is in humans analyzed on different time scales, with the millisecond range giving rise to the sensation of pitch and the periods on the order of seconds giving rise to the sensation of rhythm. In addition, unlike for humans, detecting and localizing multiple sound signals is a rather nontrivial problem for machine audition. We therefore discuss a possible implementation of human-like spatiotemporal processing of sounds in single and multisource scenarios. Our future goals are to adequately combine the constructed speech synthesis and physical audio systems, and to establish an algorithm for detailed spatiotemporal localization of both single and concurrent speech sound sources, with roughly human-like temporal and spatial processing capabilities.
“…The techniques used for experimentations on vocal synthesis are e.g. VOSIM [20], Frequency Modulation (FM) [21], Klatt Filter Model [22], Time-Domain Formant-Wave-Function Synthesis (FOF) [23], Phase Aligned Formant Synthesis (PAF) [24] and Spectral Modeling followed by Additive Synthesis [25]. These algorithms have been evaluated regarding computational costs, suitable parameterization, and of course, the resulting sound quality they produce.…”
We present a prototype of a humanoid robot head equipped with human-like speech sound localization and production systems designed for a new generation of robots that should autonomously evolve language and other cognitive skills. Similarly to the human auditory apparatus, the robot head contains a binaural sensor system based upon a frequency domain binaural model. This enables the robot to detect and locate the speaker autonomously on the basis of the produced speech signals. However, the temporal regularity of incoming sounds is in humans analyzed on different time scales, with the millisecond range giving rise to the sensation of pitch and the periods on the order of seconds giving rise to the sensation of rhythm. In addition, unlike for humans, detecting and localizing multiple sound signals is a rather nontrivial problem for machine audition. We therefore discuss a possible implementation of human-like spatiotemporal processing of sounds in single and multisource scenarios. Our future goals are to adequately combine the constructed speech synthesis and physical audio systems, and to establish an algorithm for detailed spatiotemporal localization of both single and concurrent speech sound sources, with roughly human-like temporal and spatial processing capabilities.
“…To produce a glottal pulse, the phase increment of the cosine is modulated to permit a local time-scale speedup or delay. This method, used in the VOSIM model presented by Kaegi and Templaars (1978), was suggested by Peter Pabon (1994). In the human voice, as well as in most musical instruments, an increase in sound level is associated with a decrease in spectral tilt. As a result, the higher partials gain more in amplitude during a crescendo than the lower partials.…”
Section: The Musse and Musse Dig Singing Synthesizersmentioning
JSTOR is a not-for-profit service that helps scholars, researchers, and students discover, use, and build upon a wide range of content in a trusted digital archive. We use information technology and tools to increase productivity and facilitate new forms of scholarship. For more information about JSTOR, please contact support@jstor.org.
“…VPM is similar to old techniques such as FOF [32] and VOSIM [33], where voice is modeled as a sequence of pulses whose timbre is roughly represented by a set of ideal resonances. However, in VPM the timbre is represented by all the harmonics, allowing capturing subtle details and nuances of both amplitude and phase spectra.…”
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.