Anjali Menon scite author profile

Synthetic speech has been widely used in the study of speech cues. A serious disadvantage of this method is that it requires prior knowledge about the cues to be identified in order to synthesize the speech. Incomplete or inaccurate hypotheses about the cues often lead to speech sounds of low quality. In this research a psychoacoustic method, named three-dimensional deep search ͑3DDS͒, is developed to explore the perceptual cues of stop consonants from naturally produced speech. For a given sound, it measures the contribution of each subcomponent to perception by time truncating, highpass/lowpass filtering, or masking the speech with white noise. The AI-gram, a visualization tool that simulates the auditory peripheral processing, is used to predict the audible components of the speech sound. The results are generally in agreement with the classical studies that stops are characterized by a short duration burst followed by a F2 transition, suggesting the effectiveness of the 3DDS method. However, it is also shown that /ba/ and /pa/ may have a wide band click as the dominant cue. F2 transition is not necessary for the perception of /ta/ and /ka/. Moreover, many stop consonants contain conflicting cues that are characteristic of competing sounds. The robustness of a consonant sound to noise is determined by the intensity of the dominant cue.

show abstract

A psychoacoustic method for studying the necessary and sufficient perceptual cues of American English fricative consonants in noise

Trevino

Menon

et al. 2012

View full text Add to dashboard Cite

In a previous study on plosives, the 3-Dimensional Deep Search (3DDS) method for the exploration of the necessary and sufficient cues for speech perception was introduced (Li et al., (2010). J. Acoust. Soc. Am. 127(4), 2599-2610). Here, this method is used to isolate the spectral cue regions for perception of the American English fricatives /∫, 3, s, z, f, v, θ, δ in time, frequency, and intensity. The fricatives are analyzed in the context of consonant-vowel utterances, using the vowel /α/. The necessary cues were found to be contained in the frication noise for /∫, 3, s, z, f, v/. 3DDS analysis isolated the cue regions of /s, z/ between 3.6 and 8 [kHz] and /∫, 3/ between 1.4 and 4.2 [kHz]. Some utterances were found to contain acoustic components that were unnecessary for correct perception, but caused listeners to hear non-target consonants when the primary cue region was removed; such acoustic components are labeled "conflicting cue regions." The amplitude modulation of the high-frequency frication region by the fundamental F0 was found to be a sufficient cue for voicing. Overall, the 3DDS method allows one to analyze the effects of natural speech components without initial assumptions about where perceptual cues lie in time-frequency space or which elements of production they correspond to.

show abstract

DPLM: A Deep Perceptual Spatial-Audio Localization Metric

Manocha

Kumar

et al. 2021

View full text Add to dashboard Cite

Subjective evaluations are critical for assessing the perceptual realism of sounds in audio-synthesis driven technologies like augmented and virtual reality. However, they are challenging to set up, fatiguing for users, and expensive. In this work, we tackle the problem of capturing the perceptual characteristics of localizing sounds. Specifically, we propose a framework for building a generalpurpose quality metric to assess spatial localization differences between two binaural recordings. We model localization similarity by utilizing activation-level distances from deep networks trained for direction of arrival (DOA) estimation. Our proposed metric (DPLM) outperforms baseline metrics on correlation with subjective ratings on a diverse set of datasets, even without the benefit of any humanlabeled training data.

show abstract

Sound Source Separation Using Phase Difference and Reliable Mask Selection Selection

Kim

Menon

Bacchiani

et al. 2018

View full text Add to dashboard Cite

Audio Signal Processing for Telepresence Based on Wearable Array in Noisy and Dynamic Scenes

Beit-On

Lugasi

Madmoni

et al. 2022

View full text Add to dashboard Cite

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Anjali Menon

A psychoacoustic method to find the perceptual cues of stop consonants in natural speech

A psychoacoustic method for studying the necessary and sufficient perceptual cues of American English fricative consonants in noise

DPLM: A Deep Perceptual Spatial-Audio Localization Metric

Sound Source Separation Using Phase Difference and Reliable Mask Selection Selection

Audio Signal Processing for Telepresence Based on Wearable Array in Noisy and Dynamic Scenes

Contact Info

Product

Resources

About