Multi-Level Adaptive Speech Activity Detector for Speech in Naturalistic Environments

Sharma, Bidisha; Das, Rohan Kumar; Li, Haizhou

doi:10.21437/interspeech.2019-1928

Cited by 14 publications

(8 citation statements)

References 34 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Clearly, lower DCF indicates better classification performance. Two recent studies found DCF values of 7.4% (Sharma, Das, & Li, 2019) and 11.7% (Hansen, Joglekar, Shekhar, Kothapally, Yu, Kaushik, & Sangwan, 2019) (using a = 0.25 and b = 0.75) in S/VAD for recordings from the Apollo 11 space travel mission (Kaushik et al, 2018). Another study (Dubey, Sangwan, & Hansen, 2018) evaluated algorithms using a corpus of noisy recordings from degraded military communication channels and reported DCF values (also with a = 0.25, b = 0.75) ranging from 4.3% to 8.9% (mean = 6.1%) across five novel algorithms (averaging across degraded channel conditions), where this constituted comparable performance in relation to baseline algorithms (e.g., Sholokhov, Sahidullah, & Kinnunen, 2018).…”

Section: Discussionmentioning

confidence: 96%

Circumspection in using automated measures: Talker gender and addressee affect error rates for adult speech detection in the Language ENvironment Analysis (LENA) system

et al. 2020

View full text Add to dashboard Cite

Automatic speech processing devices have become popular for quantifying amounts of ambient language input to children in their home environments. We assessed error rates for language input estimates for the Language ENvironment Analysis (LENA) audio processing system, asking whether error rates differed as a function of adult talkers' gender and whether they were speaking to children or adults. Audio was sampled from within LENA recordings from 23 families with children aged 4-34 months. Human coders identified vocalizations by adults and children, counted intelligible words, and determined whether adults' speech was addressed to children or adults. LENA's classification accuracy was assessed by parceling audio into 100-ms frames and comparing, for each frame, human and LENA classifications. LENA correctly classified adult speech 67% of the time across families (average false negative rate: 33%). LENA's adult word count showed a mean +47% error relative to human counts. Classification and Adult Word Count error rates were significantly affected by talkers' gender and whether speech was addressed to a child or an adult. The largest systematic errors occurred when adult females addressed children. Results show LENA's classifications and Adult Word Count entailed randomand sometimes largeerrors across recordings, as well as systematic errors as a function of talker gender and addressee. Due to systematic and sometimes high error in estimates of amount of adult language input, relying on this metric alone may lead to invalid clinical and/or research conclusions. Further validation studies and circumspect usage of LENA are warranted.

show abstract

Section: Discussionmentioning

confidence: 96%

Circumspection in using automated measures: Talker gender and addressee affect error rates for adult speech detection in the Language ENvironment Analysis (LENA) system

et al. 2020

View full text Add to dashboard Cite

show abstract

“…While this was similar to the 116 system submissions received for FS-1 challenge, participation for both tracks of SD and ASR tasks was noticeably higher. The systems developed for Figure 4: rVad-SincNet based SID baseline system [34,35] FS-2 also exhibited vast improvements in performance compared to the best systems developed for FS-1 challenge [2,11,12,13,15], as seen in Table-5. We observed relative improvements of 67%, 57%, and 62% for SAD, Speaker Diarization from scratch, and Speech Recognition from audio streams tasks respectively.…”

Section: Discussionmentioning

confidence: 99%

“…This began with the Inaugural FEARLESS STEPS Challenge: Massive Naturalistic Audio (FS-1). The first edition of this challenge encouraged the development of core unsupervised/semi-supervised speech and language systems for single-channel data with low resource availability, serving as the First Step towards extracting high-level information from such massive unlabeled corpora [11,12,13,14,15]. As a natural progression following the successful inaugural FS-1 challenge, the FEARLESS STEPS Challenge Phase-2 (FS-2) focuses on the development of single-channel supervised learning strategies.…”

Section: Introductionmentioning

confidence: 99%

FEARLESS STEPS Challenge (FS-2): Supervised Learning with Massive Naturalistic Apollo Data

Joglekar¹,

Hansen²,

Shekhar³

et al. 2020

Interspeech 2020

View full text Add to dashboard Cite

The Fearless Steps Initiative by UTDallas-CRSS led to the digitization, recovery, and diarization of 19,000 hours of original analog audio data, as well as the development of algorithms to extract meaningful information from this multi-channel naturalistic data resource. The 2020 FEARLESS STEPS (FS-2) Challenge is the second annual challenge held for the Speech and Language Technology community to motivate supervised learning algorithm development for multi-party and multi-stream naturalistic audio. In this paper, we present an overview of the challenge sub-tasks, data, performance metrics, and lessons learned from Phase-2 of the Fearless Steps Challenge (FS-2). We present advancements made in FS-2 through extensive community outreach and feedback. We describe innovations in the challenge corpus development, and present revised baseline results. We finally discuss the challenge outcome and general trends in system development across both phases (Phase FS-1 Unsupervised, and Phase FS-2 Supervised) of the challenge, and its continuation into multi-channel challenge tasks for the upcoming Fearless Steps Challenge Phase-3.

show abstract

“…Sound Localization and Classification (SLC) refers to estimating the spatial location of a sound source and identifying the type of a sound event through a unified framework. A SLC method enables the autonomous robots to determine sound location and detect sound events for navigation and interaction with the surroundings [1,2]. Thus, SLC is useful in smart-city and smart-home applications to automatically specify social or human activities, and assist the hearing impaired to visualize and realize sounds [3,4,5,6].…”

Section: Introductionmentioning

confidence: 99%

SLoClas: A Database for Joint Sound Localization and Classification

Qian¹,

Sharma²,

Abridi³

et al. 2021

Preprint

Self Cite

View full text Add to dashboard Cite

In this work, we present the development of a new database, namely Sound Localization and Classification (SLoClas) corpus, for studying and analyzing sound localization and classification. The corpus contains a total of 23.27 hours of data recorded using a 4-channel microphone array. 10 classes of sounds are played over a loudspeaker at 1.5 meters distance from the array by varying the Direction-of-Arrival (DoA) from 1 • to 360 • at an interval of 5 • . To facilitate the study of noise robustness, 6 types of outdoor noise are recorded at 4 DoAs, using the same devices. Moreover, we propose a baseline method, namely Sound Localization and Classification Network (SLCnet) and present the experimental results and analysis conducted on the collected SLoClas database. We achieve the accuracy of 95.21% and 80.01% for sound localization and classification, respectively. We publicly release this database and the source code for research purpose.

show abstract

Multi-Level Adaptive Speech Activity Detector for Speech in Naturalistic Environments

Cited by 14 publications

References 34 publications

Circumspection in using automated measures: Talker gender and addressee affect error rates for adult speech detection in the Language ENvironment Analysis (LENA) system

Circumspection in using automated measures: Talker gender and addressee affect error rates for adult speech detection in the Language ENvironment Analysis (LENA) system

FEARLESS STEPS Challenge (FS-2): Supervised Learning with Massive Naturalistic Apollo Data

SLoClas: A Database for Joint Sound Localization and Classification

Contact Info

Product

Resources

About