2021
DOI: 10.1186/s13636-020-00194-0
|View full text |Cite
|
Sign up to set email alerts
|

Deep multiple instance learning for foreground speech localization in ambient audio from wearable devices

Abstract: Over the recent years, machine learning techniques have been employed to produce state-of-the-art results in several audio related tasks. The success of these approaches has been largely due to access to large amounts of open-source datasets and enhancement of computational resources. However, a shortcoming of these methods is that they often fail to generalize well to tasks from real life scenarios, due to domain mismatch. One such task is foreground speech detection from wearable audio devices. Several inter… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
4
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
5
1
1

Relationship

0
7

Authors

Journals

citations
Cited by 9 publications
(4 citation statements)
references
References 24 publications
0
4
0
Order By: Relevance
“…This might in part be due to technical limitations of the used MS algorithm: Although the algorithm achieved high accuracies of more than 85% in prior studies (Lane et al, 2012; Rabbi et al, 2011), the algorithm’s accuracy in less controlled environments is probably lower, as indicated by the size of agreement with DRM and ESM in the current study. In the future, researchers will likely have access to more sophisticated algorithms—for example, first evidence suggests that algorithms based on a distinction of foreground versus background sound might outperform more traditional voice-detection algorithms (Hebbar et al, 2021).…”
Section: Discussionmentioning
confidence: 99%
See 1 more Smart Citation
“…This might in part be due to technical limitations of the used MS algorithm: Although the algorithm achieved high accuracies of more than 85% in prior studies (Lane et al, 2012; Rabbi et al, 2011), the algorithm’s accuracy in less controlled environments is probably lower, as indicated by the size of agreement with DRM and ESM in the current study. In the future, researchers will likely have access to more sophisticated algorithms—for example, first evidence suggests that algorithms based on a distinction of foreground versus background sound might outperform more traditional voice-detection algorithms (Hebbar et al, 2021).…”
Section: Discussionmentioning
confidence: 99%
“…In general, we expected the agreement between DRM and MS to be lower than the agreement between ESM and MS because of the greater time delay and increased memory biases of DRM compared with ESM and MS. We further expected DRM and ESM to agree more on face-to-face interactions than DRM and MS or ESM and MS because of a closer alignment of operationalizations (e.g., social interactions assessed in DRM and MS may include periods without conversation) and because of technical challenges of MS, such as accurately identifying speakers (e.g., the participant or a surrounding group of people) and filtering out background noise (Hebbar et al, 2021). Accordingly, we derived the following hypotheses:…”
Section: The Present Studymentioning
confidence: 99%
“…It is not uncommon for an EAR study to accrue hundreds of hours of audio data. These audio data then, at least for the moment, need to be listened to and behaviorally coded by human coders (see Dubey et al, 2016;Hebbar et al, 2021;Schindler et al, 2022, for recent proof-of-concept attempts to automate aspects of the coding). In that, then, the EAR, as a naturalistic observation method, is ultimately subject to at least some of the same challenges that lab-based observation is.…”
Section: Mobile Sensing -The Whymentioning
confidence: 99%
“…Multi-instance learning (MIL) was originally used for the field of hand-printed numerals identification [ 1 ] and drug activity prediction [ 2 ]. Instead of considering a series of individually labeled instances, MIL focuses on the labels of sets (or called bags ) of instances and demonstrate strong capabilities in many areas [ 3 ], e.g., speech localization [ 4 ], entity classification [ 5 ], protein structure determination [ 6 ], biometric authentication system [ 7 – 10 ], human pose estimation [ 11 ], medical image analysis [ 12 ], understanding chest CT imaging of COVID-19 [ 13 ], and clinical outcome prediction of COVID-19 [ 14 ].…”
Section: Introductionmentioning
confidence: 99%