Abstract:Vowel onset point (VOP) refers to the starting event of a vowel, that may be reflected in different aspects of the speech signal. The major issue in VOP detection using existing methods is the confusion among the vowels and other categories of sounds preceding them. This work explores the usefulness of sonority information to reduce this confusion and improve VOP detection. Vowels are the most sonorant sounds followed by semivowels, nasals, voiced fricatives, voiced stops. The sonority feature is derived from … Show more
“…Using this manually marked starting label, we synchronize the source (loudspeaker) signal and the 4-channel recorded audio signals. Considering the start of the audio as an anchor point, we segment all the sample sounds with energy based evidence [27,28,29] and manual observation. In this way, we achieve 988 segmented audio files and a TSP signal for each DOA angle.…”
In this work, we present the development of a new database, namely Sound Localization and Classification (SLoClas) corpus, for studying and analyzing sound localization and classification. The corpus contains a total of 23.27 hours of data recorded using a 4-channel microphone array. 10 classes of sounds are played over a loudspeaker at 1.5 meters distance from the array by varying the Direction-of-Arrival (DoA) from 1 • to 360 • at an interval of 5 • . To facilitate the study of noise robustness, 6 types of outdoor noise are recorded at 4 DoAs, using the same devices. Moreover, we propose a baseline method, namely Sound Localization and Classification Network (SLCnet) and present the experimental results and analysis conducted on the collected SLoClas database. We achieve the accuracy of 95.21% and 80.01% for sound localization and classification, respectively. We publicly release this database and the source code for research purpose.
“…Using this manually marked starting label, we synchronize the source (loudspeaker) signal and the 4-channel recorded audio signals. Considering the start of the audio as an anchor point, we segment all the sample sounds with energy based evidence [27,28,29] and manual observation. In this way, we achieve 988 segmented audio files and a TSP signal for each DOA angle.…”
In this work, we present the development of a new database, namely Sound Localization and Classification (SLoClas) corpus, for studying and analyzing sound localization and classification. The corpus contains a total of 23.27 hours of data recorded using a 4-channel microphone array. 10 classes of sounds are played over a loudspeaker at 1.5 meters distance from the array by varying the Direction-of-Arrival (DoA) from 1 • to 360 • at an interval of 5 • . To facilitate the study of noise robustness, 6 types of outdoor noise are recorded at 4 DoAs, using the same devices. Moreover, we propose a baseline method, namely Sound Localization and Classification Network (SLCnet) and present the experimental results and analysis conducted on the collected SLoClas database. We achieve the accuracy of 95.21% and 80.01% for sound localization and classification, respectively. We publicly release this database and the source code for research purpose.
“…Similarly, entropy is considered as an evidence to detect the speech in noisy conditions [16]. The vowel-like regions belong to high SNR portion of speech signals and are less affected by noise [17][18][19]. Similarly, glottal activity detection and sonorant region detection are performed to identify the speech regions in a noisy scenario [20,21].…”
Speech activity detection (SAD) is a part of many speech processing applications. The traditional SAD approaches use signal energy as the evidence to identify the speech regions. However, such methods perform poorly under uncontrolled environments. In this work, we propose a novel SAD approach using a multi-level decision with signal knowledge in an adaptive manner. The multi-level evidence considered are modulation spectrum and smoothed Hilbert envelope of linear prediction (LP) residual. Modulation spectrum has compelling parallels to the dynamics of speech production and captures information only for the speech component. Contrarily, Hilbert envelope of LP residual captures excitation source aspect of speech. Under uncontrolled scenario, these evidence are found to be robust towards the signal distortions and thus expected to work well. In view of different levels of interference present in the signal, we propose to use a quality factor to control the speech/nonspeech decision in an adaptive manner. We refer this method as multi-level adaptive SAD and evaluate on Fearless Steps corpus that is collected during Apollo-11 Mission in naturalistic environments. We achieve a detection cost function of 7.35% with the proposed multi-level adaptive SAD on the evaluation set of Fearless Steps 2019 challenge corpus.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.