2021
DOI: 10.1186/s13636-020-00194-0
|View full text |Cite
|
Sign up to set email alerts
|

Deep multiple instance learning for foreground speech localization in ambient audio from wearable devices

Abstract: Over the recent years, machine learning techniques have been employed to produce state-of-the-art results in several audio related tasks. The success of these approaches has been largely due to access to large amounts of open-source datasets and enhancement of computational resources. However, a shortcoming of these methods is that they often fail to generalize well to tasks from real life scenarios, due to domain mismatch. One such task is foreground speech detection from wearable audio devices. Several inter… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
4
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
4
2
2

Relationship

0
8

Authors

Journals

citations
Cited by 11 publications
(5 citation statements)
references
References 24 publications
0
4
0
Order By: Relevance
“…This might in part be due to technical limitations of the used MS algorithm: Although the algorithm achieved high accuracies of more than 85% in prior studies (Lane et al, 2012; Rabbi et al, 2011), the algorithm’s accuracy in less controlled environments is probably lower, as indicated by the size of agreement with DRM and ESM in the current study. In the future, researchers will likely have access to more sophisticated algorithms—for example, first evidence suggests that algorithms based on a distinction of foreground versus background sound might outperform more traditional voice-detection algorithms (Hebbar et al, 2021).…”
Section: Discussionmentioning
confidence: 99%
See 1 more Smart Citation
“…This might in part be due to technical limitations of the used MS algorithm: Although the algorithm achieved high accuracies of more than 85% in prior studies (Lane et al, 2012; Rabbi et al, 2011), the algorithm’s accuracy in less controlled environments is probably lower, as indicated by the size of agreement with DRM and ESM in the current study. In the future, researchers will likely have access to more sophisticated algorithms—for example, first evidence suggests that algorithms based on a distinction of foreground versus background sound might outperform more traditional voice-detection algorithms (Hebbar et al, 2021).…”
Section: Discussionmentioning
confidence: 99%
“…In general, we expected the agreement between DRM and MS to be lower than the agreement between ESM and MS because of the greater time delay and increased memory biases of DRM compared with ESM and MS. We further expected DRM and ESM to agree more on face-to-face interactions than DRM and MS or ESM and MS because of a closer alignment of operationalizations (e.g., social interactions assessed in DRM and MS may include periods without conversation) and because of technical challenges of MS, such as accurately identifying speakers (e.g., the participant or a surrounding group of people) and filtering out background noise (Hebbar et al, 2021). Accordingly, we derived the following hypotheses:…”
Section: The Present Studymentioning
confidence: 99%
“…It is not uncommon for an EAR study to accrue hundreds of hours of audio data. These audio data then, at least for the moment, need to be listened to and behaviorally coded by human coders (see Dubey et al, 2016;Hebbar et al, 2021;Schindler et al, 2022, for recent proof-of-concept attempts to automate aspects of the coding). In that, then, the EAR, as a naturalistic observation method, is ultimately subject to at least some of the same challenges that lab-based observation is.…”
Section: Mobile Sensing -The Whymentioning
confidence: 99%
“…However, it is only a matter of time before automated behavioral codings for some types of variables will become a possibility. First attempts are already being made (Dubey et al, 2016;Hebbar et al, 2021;Schindler et al, 2022), and if their accuracy and robustness can be increased, this would considerably reduce the time and financial resources currently required for coding speech data.…”
Section: Addressing Biases Inserted By Automationmentioning
confidence: 99%