Abstract:Speech activity detection (SAD) is a part of many speech processing applications. The traditional SAD approaches use signal energy as the evidence to identify the speech regions. However, such methods perform poorly under uncontrolled environments. In this work, we propose a novel SAD approach using a multi-level decision with signal knowledge in an adaptive manner. The multi-level evidence considered are modulation spectrum and smoothed Hilbert envelope of linear prediction (LP) residual. Modulation spectrum … Show more
“…Clearly, lower DCF indicates better classification performance. Two recent studies found DCF values of 7.4% (Sharma, Das, & Li, 2019) and 11.7% (Hansen, Joglekar, Shekhar, Kothapally, Yu, Kaushik, & Sangwan, 2019) (using a = 0.25 and b = 0.75) in S/VAD for recordings from the Apollo 11 space travel mission (Kaushik et al, 2018). Another study (Dubey, Sangwan, & Hansen, 2018) evaluated algorithms using a corpus of noisy recordings from degraded military communication channels and reported DCF values (also with a = 0.25, b = 0.75) ranging from 4.3% to 8.9% (mean = 6.1%) across five novel algorithms (averaging across degraded channel conditions), where this constituted comparable performance in relation to baseline algorithms (e.g., Sholokhov, Sahidullah, & Kinnunen, 2018).…”
Automatic speech processing devices have become popular for quantifying amounts of ambient language input to children in their home environments. We assessed error rates for language input estimates for the Language ENvironment Analysis (LENA) audio processing system, asking whether error rates differed as a function of adult talkers' gender and whether they were speaking to children or adults. Audio was sampled from within LENA recordings from 23 families with children aged 4-34 months. Human coders identified vocalizations by adults and children, counted intelligible words, and determined whether adults' speech was addressed to children or adults. LENA's classification accuracy was assessed by parceling audio into 100-ms frames and comparing, for each frame, human and LENA classifications. LENA correctly classified adult speech 67% of the time across families (average false negative rate: 33%). LENA's adult word count showed a mean +47% error relative to human counts. Classification and Adult Word Count error rates were significantly affected by talkers' gender and whether speech was addressed to a child or an adult. The largest systematic errors occurred when adult females addressed children. Results show LENA's classifications and Adult Word Count entailed randomand sometimes largeerrors across recordings, as well as systematic errors as a function of talker gender and addressee. Due to systematic and sometimes high error in estimates of amount of adult language input, relying on this metric alone may lead to invalid clinical and/or research conclusions. Further validation studies and circumspect usage of LENA are warranted.
“…Clearly, lower DCF indicates better classification performance. Two recent studies found DCF values of 7.4% (Sharma, Das, & Li, 2019) and 11.7% (Hansen, Joglekar, Shekhar, Kothapally, Yu, Kaushik, & Sangwan, 2019) (using a = 0.25 and b = 0.75) in S/VAD for recordings from the Apollo 11 space travel mission (Kaushik et al, 2018). Another study (Dubey, Sangwan, & Hansen, 2018) evaluated algorithms using a corpus of noisy recordings from degraded military communication channels and reported DCF values (also with a = 0.25, b = 0.75) ranging from 4.3% to 8.9% (mean = 6.1%) across five novel algorithms (averaging across degraded channel conditions), where this constituted comparable performance in relation to baseline algorithms (e.g., Sholokhov, Sahidullah, & Kinnunen, 2018).…”
Automatic speech processing devices have become popular for quantifying amounts of ambient language input to children in their home environments. We assessed error rates for language input estimates for the Language ENvironment Analysis (LENA) audio processing system, asking whether error rates differed as a function of adult talkers' gender and whether they were speaking to children or adults. Audio was sampled from within LENA recordings from 23 families with children aged 4-34 months. Human coders identified vocalizations by adults and children, counted intelligible words, and determined whether adults' speech was addressed to children or adults. LENA's classification accuracy was assessed by parceling audio into 100-ms frames and comparing, for each frame, human and LENA classifications. LENA correctly classified adult speech 67% of the time across families (average false negative rate: 33%). LENA's adult word count showed a mean +47% error relative to human counts. Classification and Adult Word Count error rates were significantly affected by talkers' gender and whether speech was addressed to a child or an adult. The largest systematic errors occurred when adult females addressed children. Results show LENA's classifications and Adult Word Count entailed randomand sometimes largeerrors across recordings, as well as systematic errors as a function of talker gender and addressee. Due to systematic and sometimes high error in estimates of amount of adult language input, relying on this metric alone may lead to invalid clinical and/or research conclusions. Further validation studies and circumspect usage of LENA are warranted.
“…While this was similar to the 116 system submissions received for FS-1 challenge, participation for both tracks of SD and ASR tasks was noticeably higher. The systems developed for Figure 4: rVad-SincNet based SID baseline system [34,35] FS-2 also exhibited vast improvements in performance compared to the best systems developed for FS-1 challenge [2,11,12,13,15], as seen in Table-5. We observed relative improvements of 67%, 57%, and 62% for SAD, Speaker Diarization from scratch, and Speech Recognition from audio streams tasks respectively.…”
Section: Discussionmentioning
confidence: 99%
“…This began with the Inaugural FEARLESS STEPS Challenge: Massive Naturalistic Audio (FS-1). The first edition of this challenge encouraged the development of core unsupervised/semi-supervised speech and language systems for single-channel data with low resource availability, serving as the First Step towards extracting high-level information from such massive unlabeled corpora [11,12,13,14,15]. As a natural progression following the successful inaugural FS-1 challenge, the FEARLESS STEPS Challenge Phase-2 (FS-2) focuses on the development of single-channel supervised learning strategies.…”
The Fearless Steps Initiative by UTDallas-CRSS led to the digitization, recovery, and diarization of 19,000 hours of original analog audio data, as well as the development of algorithms to extract meaningful information from this multi-channel naturalistic data resource. The 2020 FEARLESS STEPS (FS-2) Challenge is the second annual challenge held for the Speech and Language Technology community to motivate supervised learning algorithm development for multi-party and multi-stream naturalistic audio. In this paper, we present an overview of the challenge sub-tasks, data, performance metrics, and lessons learned from Phase-2 of the Fearless Steps Challenge (FS-2). We present advancements made in FS-2 through extensive community outreach and feedback. We describe innovations in the challenge corpus development, and present revised baseline results. We finally discuss the challenge outcome and general trends in system development across both phases (Phase FS-1 Unsupervised, and Phase FS-2 Supervised) of the challenge, and its continuation into multi-channel challenge tasks for the upcoming Fearless Steps Challenge Phase-3.
“…Sound Localization and Classification (SLC) refers to estimating the spatial location of a sound source and identifying the type of a sound event through a unified framework. A SLC method enables the autonomous robots to determine sound location and detect sound events for navigation and interaction with the surroundings [1,2]. Thus, SLC is useful in smart-city and smart-home applications to automatically specify social or human activities, and assist the hearing impaired to visualize and realize sounds [3,4,5,6].…”
In this work, we present the development of a new database, namely Sound Localization and Classification (SLoClas) corpus, for studying and analyzing sound localization and classification. The corpus contains a total of 23.27 hours of data recorded using a 4-channel microphone array. 10 classes of sounds are played over a loudspeaker at 1.5 meters distance from the array by varying the Direction-of-Arrival (DoA) from 1 • to 360 • at an interval of 5 • . To facilitate the study of noise robustness, 6 types of outdoor noise are recorded at 4 DoAs, using the same devices. Moreover, we propose a baseline method, namely Sound Localization and Classification Network (SLCnet) and present the experimental results and analysis conducted on the collected SLoClas database. We achieve the accuracy of 95.21% and 80.01% for sound localization and classification, respectively. We publicly release this database and the source code for research purpose.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.